Our goal is to predict 10-year ASCVD risk in adults using key features such as age, gender, race, smoking status, diabetes, hypertension, and cholesterol levels. The dataset aims to facilitate accurate risk assessments and guide targeted preventive healthcare interventions.
It consist of 1000 row that each have 10 attributes.
“Risk”; 10-year risk for ASCVD which is categorized as:
Low-risk (<5%)
Borderline risk (5% to 7.4%)
Intermediate risk (7.5% to 19.9%)
High risk (≥20%)
## Attribute_Name Description Data_Type
## 1 isMale Gender Binary
## 2 isBlack Race Binary
## 3 isSmoker Smoking Status Binary
## 4 isDiabetic Diabetes Status Binary
## 5 isHypertensive Hypertension Status Binary
## 6 Age Age of the candidate Numeric (Integer)
## 7 Systolic Max Blood Pressure Numeric (Integer)
## 8 Cholesterol Total Cholesterol Numeric (Integer)
## 9 HDL HDL Cholesterol Numeric (Integer)
## 10 Risk (class label) 10-year ASCVD Risk Numeric (Decimal)
## Possible_Values
## 1 0 (Female), 1 (Male)
## 2 0 (Not Black), 1 (Black)
## 3 0 (Non-smoker), 1 (Smoker)
## 4 0 (Normal), 1 (Diabetic)
## 5 0 (Normal BP), 1 (High BP)
## 6 Range between 40-79
## 7 Range between 90-200
## 8 Range between 130-200
## 9 Range between 20-100
## 10 Low, Borderline, Intermediate, High risk
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
dataset <- read.csv("heartRisk.csv")
print(dataset)
## isMale isBlack isSmoker isDiabetic isHypertensive Age Systolic Cholesterol
## 1 1 1 0 1 1 49 101 181
## 2 0 0 0 1 1 69 167 155
## 3 0 1 1 1 1 50 181 147
## 4 1 1 1 1 0 42 145 166
## 5 0 0 1 0 1 66 134 199
## 6 0 0 1 0 1 52 154 174
## 7 1 0 1 0 0 40 104 187
## 8 1 0 1 1 0 75 136 189
## 9 0 0 1 0 1 42 169 179
## 10 1 0 0 1 1 65 196 187
## 11 1 0 1 1 0 72 154 179
## 12 0 1 0 0 1 57 195 188
## 13 1 0 0 0 1 43 136 131
## 14 1 1 1 0 1 61 129 156
## 15 1 0 0 0 1 54 165 197
## 16 0 0 1 1 0 67 127 137
## 17 1 1 0 0 1 73 140 159
## 18 0 1 1 0 1 40 200 181
## 19 1 1 1 1 1 64 167 189
## 20 0 1 0 1 1 76 192 130
## 21 0 0 1 1 0 63 106 145
## 22 1 1 0 0 0 74 99 167
## 23 0 0 1 1 1 45 146 153
## 24 1 1 1 0 1 59 186 170
## 25 1 1 0 1 1 60 141 148
## 26 1 1 0 1 1 68 119 177
## 27 0 1 0 1 1 60 131 135
## 28 1 1 0 0 1 54 169 152
## 29 1 0 1 0 1 64 143 181
## 30 1 1 0 0 0 68 198 194
## 31 0 0 0 1 1 56 180 147
## 32 0 0 0 0 1 46 119 195
## 33 1 0 0 0 1 57 114 187
## 34 1 1 1 0 1 59 147 133
## 35 0 1 1 0 0 70 165 135
## 36 1 1 0 1 1 56 160 145
## 37 0 0 1 1 0 73 188 172
## 38 1 1 0 0 1 52 126 172
## 39 1 0 1 1 0 67 94 189
## 40 1 1 0 0 0 60 143 188
## 41 0 1 1 1 1 75 181 154
## 42 0 0 1 0 0 61 174 163
## 43 1 0 0 0 0 59 174 197
## 44 0 0 0 0 0 47 167 154
## 45 0 1 0 0 0 42 92 184
## 46 1 1 0 1 0 67 155 199
## 47 0 1 1 0 0 57 161 198
## 48 1 0 1 1 0 40 90 160
## 49 1 0 0 0 0 41 99 169
## 50 0 0 1 1 0 77 178 138
## 51 1 1 1 1 1 45 166 164
## 52 1 1 1 1 1 68 121 132
## 53 0 1 0 1 0 76 111 153
## 54 0 1 1 1 0 51 114 162
## 55 0 0 1 0 1 76 178 157
## 56 0 0 0 0 0 52 192 148
## 57 0 1 1 1 0 68 169 147
## 58 0 0 0 1 0 47 125 146
## 59 1 0 0 1 0 58 113 185
## 60 0 1 0 0 1 78 103 139
## 61 0 1 1 1 0 41 145 159
## 62 0 0 1 0 1 61 101 177
## 63 0 0 1 1 1 44 157 130
## 64 1 1 0 1 1 79 150 178
## 65 1 1 1 0 1 45 169 148
## 66 0 1 0 1 0 75 138 149
## 67 1 1 0 0 1 41 182 179
## 68 0 1 1 0 1 40 148 194
## 69 1 1 1 1 0 75 106 142
## 70 1 1 0 0 0 75 107 130
## 71 0 0 1 1 0 58 102 200
## 72 1 1 1 1 1 47 129 189
## 73 0 0 1 1 1 70 186 179
## 74 0 1 1 1 1 69 123 130
## 75 0 1 0 0 0 57 112 199
## 76 0 0 0 1 1 69 130 189
## 77 1 0 0 0 0 58 91 187
## 78 1 1 0 0 0 40 125 183
## 79 1 1 1 0 1 53 187 175
## 80 0 1 1 1 0 71 150 176
## 81 0 1 1 0 1 78 185 137
## 82 0 1 1 1 0 59 190 146
## 83 0 1 0 0 1 67 99 167
## 84 0 1 0 0 1 78 138 193
## 85 1 0 0 0 1 62 134 175
## 86 1 0 0 1 0 73 102 137
## 87 1 1 0 0 0 55 176 138
## 88 1 1 0 1 0 46 101 187
## 89 0 1 1 0 0 51 136 168
## 90 1 0 0 1 1 49 191 131
## 91 0 1 1 0 0 52 95 145
## 92 0 1 1 1 0 51 184 153
## 93 0 1 0 1 0 74 99 187
## 94 1 0 1 0 1 73 127 142
## 95 1 1 1 0 1 58 183 154
## 96 0 0 1 0 1 45 100 168
## 97 0 1 0 0 1 66 152 162
## 98 1 1 0 1 1 72 169 164
## 99 0 0 1 0 0 58 100 196
## 100 1 0 1 0 1 79 188 135
## 101 0 0 0 0 1 60 167 146
## 102 1 0 1 0 0 76 118 154
## 103 0 1 0 0 1 49 102 157
## 104 1 1 1 0 1 64 124 134
## 105 0 1 1 0 0 42 142 156
## 106 1 0 0 1 0 71 193 130
## 107 1 0 0 0 1 51 189 136
## 108 0 0 0 1 0 50 95 174
## 109 0 0 0 1 0 72 106 157
## 110 1 0 1 1 0 51 100 131
## 111 0 0 1 0 0 55 95 136
## 112 0 0 1 1 0 71 157 182
## 113 0 1 1 1 0 44 175 151
## 114 1 0 1 0 0 75 159 150
## 115 0 0 0 1 1 49 95 141
## 116 0 0 0 0 0 72 135 131
## 117 0 0 0 0 0 47 106 197
## 118 1 1 0 0 1 46 126 138
## 119 0 1 1 1 0 75 196 185
## 120 0 0 1 1 1 69 186 154
## 121 1 1 1 0 0 70 94 187
## 122 1 1 0 1 0 42 192 169
## 123 0 1 1 0 0 70 164 177
## 124 1 1 0 0 1 49 112 147
## 125 1 0 0 1 0 64 108 149
## 126 1 1 1 0 0 44 178 161
## 127 0 1 0 0 0 51 188 158
## 128 1 1 0 0 0 49 119 154
## 129 0 0 1 1 1 57 110 173
## 130 0 1 1 1 1 69 104 157
## 131 0 0 0 1 0 56 114 177
## 132 0 0 0 0 1 55 163 148
## 133 0 1 1 0 0 76 102 191
## 134 1 1 1 1 1 73 128 131
## 135 0 1 0 0 1 57 161 193
## 136 0 1 0 1 0 63 170 194
## 137 1 0 1 1 0 54 184 138
## 138 0 0 1 0 1 40 173 155
## 139 1 1 0 0 1 47 94 153
## 140 0 0 1 1 0 51 161 163
## 141 0 1 0 1 0 48 187 170
## 142 1 0 0 1 1 49 144 150
## 143 0 0 1 1 0 76 117 136
## 144 1 1 1 1 1 69 104 193
## 145 1 1 0 1 1 53 151 170
## 146 1 0 0 0 0 63 121 178
## 147 1 0 1 1 1 50 191 155
## 148 1 1 0 1 0 76 193 190
## 149 0 0 0 0 0 67 143 197
## 150 1 0 1 0 0 42 119 171
## 151 1 0 1 1 1 70 118 137
## 152 0 1 1 1 1 48 189 173
## 153 1 1 1 0 0 46 109 173
## 154 1 1 0 0 0 60 115 178
## 155 1 1 1 0 0 46 116 180
## 156 1 0 1 1 1 62 144 136
## 157 1 1 0 1 0 67 168 132
## 158 0 1 0 0 1 50 144 186
## 159 0 0 0 1 1 70 168 165
## 160 0 0 1 0 0 50 154 131
## 161 1 1 0 0 1 71 161 179
## 162 1 1 1 1 0 50 189 191
## 163 0 0 1 0 1 67 197 192
## 164 0 1 1 1 1 54 122 135
## 165 1 1 1 0 0 47 190 193
## 166 1 0 1 1 1 40 151 170
## 167 0 0 0 1 1 43 125 174
## 168 1 0 0 0 1 44 169 183
## 169 0 0 1 1 0 43 146 171
## 170 0 1 1 1 0 74 124 164
## 171 1 1 0 1 1 77 131 198
## 172 0 0 1 1 0 61 92 164
## 173 0 1 0 1 0 49 187 194
## 174 1 0 1 0 1 69 164 173
## 175 0 0 1 0 1 64 178 155
## 176 1 0 1 1 0 40 104 142
## 177 1 1 1 0 0 43 129 196
## 178 1 1 0 1 1 62 160 166
## 179 0 1 0 0 1 62 117 133
## 180 1 0 0 1 0 48 151 142
## 181 1 1 0 0 0 71 105 192
## 182 0 0 0 0 1 46 96 136
## 183 1 1 0 1 0 74 142 195
## 184 0 1 0 0 0 51 149 155
## 185 0 1 1 0 0 60 121 172
## 186 1 1 0 0 1 72 158 183
## 187 0 1 0 1 1 44 178 193
## 188 0 0 0 0 1 74 195 140
## 189 0 0 1 0 1 43 146 178
## 190 0 0 1 1 1 61 182 140
## 191 0 0 0 1 1 75 128 152
## 192 1 1 0 0 0 62 192 135
## 193 1 1 0 1 0 45 145 154
## 194 1 0 0 1 0 71 174 176
## 195 0 1 0 0 0 58 160 144
## 196 1 1 1 0 0 77 123 182
## 197 0 0 1 0 1 56 193 138
## 198 1 1 0 1 1 56 129 154
## 199 1 0 0 1 1 74 175 146
## 200 1 0 0 0 0 41 91 156
## 201 1 1 0 1 1 77 197 130
## 202 1 1 0 1 0 69 140 150
## 203 0 1 0 0 1 72 125 164
## 204 1 1 0 0 1 61 120 173
## 205 1 0 1 1 1 79 128 160
## 206 1 1 1 1 1 57 120 141
## 207 0 1 0 1 0 66 106 187
## 208 1 1 1 0 1 69 161 175
## 209 0 1 1 1 0 70 132 149
## 210 1 0 0 0 0 61 170 164
## 211 1 0 1 1 0 58 127 184
## 212 0 0 1 0 1 67 186 154
## 213 1 0 1 1 1 73 187 167
## 214 1 1 1 1 0 40 93 161
## 215 1 1 1 0 0 68 155 131
## 216 1 1 0 0 0 41 126 184
## 217 0 0 0 0 1 71 196 186
## 218 1 1 1 1 1 56 174 187
## 219 0 0 0 1 0 57 179 185
## 220 1 1 1 0 1 46 200 195
## 221 0 1 0 1 1 78 157 188
## 222 0 0 1 1 0 61 189 170
## 223 0 1 1 0 1 46 90 170
## 224 1 0 1 0 1 69 124 131
## 225 0 0 1 1 1 74 181 176
## 226 0 1 1 0 0 68 118 196
## 227 0 0 0 1 0 65 175 135
## 228 1 0 0 1 1 68 154 176
## 229 1 1 1 1 1 48 150 167
## 230 1 1 1 0 0 76 141 176
## 231 0 0 0 0 0 76 181 169
## 232 0 0 0 1 1 67 135 197
## 233 0 0 0 0 1 69 148 169
## 234 1 1 0 1 1 69 188 182
## 235 0 0 0 1 0 47 182 193
## 236 0 0 1 0 1 56 133 145
## 237 1 1 1 1 0 78 152 146
## 238 1 1 0 1 0 52 178 149
## 239 1 0 1 0 1 49 150 186
## 240 1 0 1 1 1 75 137 176
## 241 1 0 1 1 1 40 152 130
## 242 0 1 0 1 0 69 103 198
## 243 0 1 1 0 1 61 138 136
## 244 0 1 0 0 0 56 90 182
## 245 0 0 0 0 0 61 190 147
## 246 0 1 0 0 0 66 106 187
## 247 1 1 0 0 0 41 142 190
## 248 1 1 1 1 1 65 117 160
## 249 1 1 0 1 0 53 94 199
## 250 1 0 0 0 1 43 122 135
## 251 1 1 0 0 1 65 105 184
## 252 0 0 0 1 0 59 181 178
## 253 0 1 1 0 0 54 116 176
## 254 0 0 1 0 1 55 188 146
## 255 0 1 0 1 1 64 123 184
## 256 1 0 1 0 1 56 132 181
## 257 1 0 1 0 0 50 108 145
## 258 1 1 1 0 1 70 129 143
## 259 0 1 0 0 0 73 94 190
## 260 1 1 1 1 0 46 110 198
## 261 0 0 0 0 0 54 94 132
## 262 1 0 1 0 0 63 189 148
## 263 0 1 0 0 1 53 140 164
## 264 0 0 0 1 1 48 138 181
## 265 0 0 0 1 1 54 181 142
## 266 0 0 0 0 0 42 143 160
## 267 1 0 1 0 1 60 130 133
## 268 1 0 0 0 0 74 95 169
## 269 1 0 0 1 1 69 193 172
## 270 0 1 1 0 1 61 105 139
## 271 1 0 1 1 1 76 137 146
## 272 0 1 0 1 1 58 198 139
## 273 0 1 0 0 1 70 125 135
## 274 1 0 0 1 0 50 181 161
## 275 1 1 0 0 1 68 99 166
## 276 0 1 1 0 0 63 98 173
## 277 0 0 1 1 1 72 118 188
## 278 0 1 1 0 1 73 156 137
## 279 1 1 1 0 1 70 120 176
## 280 0 1 1 0 0 45 170 158
## 281 0 1 1 0 0 70 183 133
## 282 1 1 0 0 1 58 156 178
## 283 0 0 1 1 1 63 124 139
## 284 1 1 0 1 0 73 183 187
## 285 0 0 0 1 0 57 105 141
## 286 0 0 1 1 0 49 189 174
## 287 0 0 1 0 0 43 122 159
## 288 0 1 0 0 0 74 106 185
## 289 1 1 0 0 0 75 137 136
## 290 1 0 0 0 1 73 130 161
## 291 0 1 1 1 0 60 145 167
## 292 1 1 0 0 0 43 191 180
## 293 0 1 1 0 1 47 200 190
## 294 1 0 1 1 1 66 177 177
## 295 1 1 1 0 1 66 117 139
## 296 0 0 1 1 1 49 99 186
## 297 0 1 0 0 0 71 194 156
## 298 0 1 0 1 1 60 126 172
## 299 0 0 0 1 1 68 105 167
## 300 0 0 0 0 1 60 194 151
## 301 0 0 1 1 0 54 154 144
## 302 1 1 1 0 1 60 159 155
## 303 1 0 0 1 1 59 166 172
## 304 0 0 1 1 0 48 98 155
## 305 0 1 0 0 0 54 188 150
## 306 1 0 0 1 1 63 199 158
## 307 0 0 1 1 1 66 127 168
## 308 0 0 1 0 0 46 158 130
## 309 1 0 1 1 1 73 161 165
## 310 1 0 1 1 0 57 113 191
## 311 1 0 1 0 0 41 90 160
## 312 1 0 1 1 1 76 163 136
## 313 1 1 0 0 1 69 191 136
## 314 0 1 1 1 1 72 192 143
## 315 1 1 1 1 0 50 157 141
## 316 1 0 1 0 0 77 105 169
## 317 0 0 1 1 1 54 155 134
## 318 1 1 1 0 1 52 112 139
## 319 1 1 1 0 0 42 165 177
## 320 0 0 0 1 1 62 154 198
## 321 0 0 1 1 0 64 158 170
## 322 1 0 0 1 1 79 152 155
## 323 1 0 1 1 0 65 191 146
## 324 0 1 0 1 1 56 178 138
## 325 1 0 0 0 1 78 126 182
## 326 1 0 1 0 1 62 107 188
## 327 0 0 1 0 0 49 165 135
## 328 0 1 1 1 1 69 135 190
## 329 1 0 0 0 1 71 122 141
## 330 1 0 0 1 0 70 144 174
## 331 0 1 1 0 0 64 139 143
## 332 0 0 0 1 0 77 160 171
## 333 1 1 0 1 0 64 200 187
## 334 0 1 1 0 0 52 103 163
## 335 1 0 0 0 0 61 196 159
## 336 1 1 0 1 1 40 92 158
## 337 1 0 1 0 1 42 117 149
## 338 0 0 1 1 1 48 114 181
## 339 0 1 0 0 0 66 154 193
## 340 0 1 0 1 1 53 98 183
## 341 1 0 0 1 0 43 125 163
## 342 1 1 0 1 0 44 153 163
## 343 0 1 0 0 0 62 159 162
## 344 0 1 0 0 1 44 123 170
## 345 1 1 1 0 1 79 100 188
## 346 1 1 1 1 1 55 103 161
## 347 0 1 0 0 0 48 165 145
## 348 0 1 0 1 1 48 119 181
## 349 0 0 0 1 0 65 156 192
## 350 1 0 1 1 1 48 96 153
## 351 1 0 1 1 0 44 185 187
## 352 0 1 0 1 1 76 163 136
## 353 1 0 0 1 1 49 107 150
## 354 0 1 0 1 1 65 184 146
## 355 0 0 1 1 0 51 126 133
## 356 1 0 1 1 1 41 93 154
## 357 0 1 0 1 0 62 110 193
## 358 0 0 0 0 1 48 135 147
## 359 0 0 0 1 1 45 118 193
## 360 0 1 0 1 1 71 169 189
## 361 1 1 0 1 1 60 124 172
## 362 1 1 0 1 1 48 154 158
## 363 1 0 0 0 1 69 97 199
## 364 1 1 0 1 0 60 157 174
## 365 1 1 1 0 0 53 184 183
## 366 1 1 1 0 0 41 170 157
## 367 1 0 0 1 0 42 118 150
## 368 0 1 1 0 1 64 100 184
## 369 0 1 0 1 1 61 149 167
## 370 0 0 1 1 0 76 185 196
## 371 1 1 0 0 0 49 181 135
## 372 0 0 0 1 1 63 95 149
## 373 1 0 1 1 1 48 171 180
## 374 0 0 0 0 1 50 168 166
## 375 1 1 0 1 0 55 108 156
## 376 0 0 1 0 0 78 128 165
## 377 1 0 1 0 1 68 106 184
## 378 0 1 0 0 0 59 191 149
## 379 1 1 1 1 0 55 147 186
## 380 1 1 0 0 0 64 200 175
## 381 1 1 0 0 1 54 140 156
## 382 1 1 0 0 1 75 149 149
## 383 1 0 0 0 0 64 118 166
## 384 0 0 1 1 0 70 196 182
## 385 1 0 1 1 0 68 162 150
## 386 1 1 0 0 1 71 154 149
## 387 0 1 0 1 1 73 106 185
## 388 1 0 0 1 0 59 192 180
## 389 1 1 1 1 1 47 197 150
## 390 1 1 1 1 0 57 139 155
## 391 0 0 1 1 1 55 176 132
## 392 0 1 0 1 0 79 94 192
## 393 0 0 1 1 1 46 112 137
## 394 0 0 1 0 0 45 162 160
## 395 0 0 1 0 1 76 180 189
## 396 1 1 1 0 0 57 160 146
## 397 1 0 0 1 0 49 167 161
## 398 1 0 1 1 1 46 155 182
## 399 1 1 0 1 0 57 97 180
## 400 0 1 1 1 0 77 95 184
## 401 1 0 1 1 1 44 130 190
## 402 1 1 0 0 0 72 178 135
## 403 0 1 1 0 1 46 188 173
## 404 0 0 1 1 1 54 161 190
## 405 1 0 0 0 0 55 187 135
## 406 1 0 0 1 1 42 149 190
## 407 1 0 1 1 0 63 182 181
## 408 1 0 1 0 1 48 133 174
## 409 1 1 1 0 1 77 188 176
## 410 0 1 1 0 0 77 155 137
## 411 0 1 0 1 0 72 166 154
## 412 1 1 0 0 0 60 90 187
## 413 0 0 0 1 1 43 159 156
## 414 0 0 0 0 0 49 112 147
## 415 1 0 1 1 1 50 118 177
## 416 1 1 1 1 0 55 189 154
## 417 1 0 1 0 0 44 138 192
## 418 1 1 1 1 0 46 199 151
## 419 0 1 1 0 1 72 157 194
## 420 1 0 1 1 1 44 175 148
## 421 1 0 0 0 0 60 116 148
## 422 0 0 0 1 1 66 121 197
## 423 0 0 1 0 1 44 180 199
## 424 0 1 0 0 1 48 153 191
## 425 1 0 0 1 1 51 102 164
## 426 0 0 1 0 1 72 169 179
## 427 1 0 0 0 0 67 189 161
## 428 0 1 0 1 0 66 123 182
## 429 0 1 0 1 1 53 133 181
## 430 0 0 1 1 0 46 129 176
## 431 1 1 1 0 0 71 167 176
## 432 0 0 0 1 0 61 135 151
## 433 1 1 1 0 1 67 174 172
## 434 1 1 0 1 1 43 141 143
## 435 0 0 1 1 1 69 106 183
## 436 0 1 0 0 1 75 134 137
## 437 0 1 0 0 1 44 193 200
## 438 0 1 1 0 0 49 180 195
## 439 1 0 1 0 1 51 170 144
## 440 0 0 1 1 0 63 105 139
## 441 1 0 1 0 1 58 181 163
## 442 1 0 0 1 0 51 109 174
## 443 1 0 1 1 1 63 124 135
## 444 1 1 0 0 0 79 107 195
## 445 0 0 1 1 0 71 200 141
## 446 0 1 1 1 1 71 157 146
## 447 0 1 0 0 1 50 91 155
## 448 1 0 0 1 0 76 179 166
## 449 1 1 1 0 0 61 130 192
## 450 0 0 1 1 1 59 122 171
## 451 1 0 0 1 0 57 96 135
## 452 0 0 1 1 1 42 153 148
## 453 1 1 0 1 0 46 200 173
## 454 1 1 1 1 0 57 136 168
## 455 0 0 0 1 1 70 104 146
## 456 1 0 1 1 0 65 140 188
## 457 0 1 0 0 1 49 115 186
## 458 0 1 1 0 1 53 94 135
## 459 1 1 1 1 0 42 186 149
## 460 1 1 0 0 1 74 126 169
## 461 0 1 1 1 0 65 150 134
## 462 0 0 1 1 0 66 119 136
## 463 0 1 1 0 1 48 167 147
## 464 1 1 0 0 0 51 119 177
## 465 1 0 1 1 1 47 145 186
## 466 0 0 1 1 1 54 91 180
## 467 0 1 1 1 0 66 156 130
## 468 0 0 1 0 1 71 137 156
## 469 0 0 1 0 0 73 175 137
## 470 0 0 1 1 0 56 166 174
## 471 1 1 1 1 0 65 118 190
## 472 0 1 1 0 0 70 103 142
## 473 0 1 1 0 1 50 155 199
## 474 0 1 1 1 0 72 95 134
## 475 1 0 0 1 0 75 195 144
## 476 0 0 0 1 0 56 113 159
## 477 0 0 0 1 0 56 191 185
## 478 0 0 1 0 0 51 180 135
## 479 1 0 0 1 1 75 109 135
## 480 1 0 1 1 0 68 144 185
## 481 0 1 1 0 0 40 102 152
## 482 0 1 0 0 0 72 95 176
## 483 1 1 0 0 1 65 197 143
## 484 1 0 1 0 1 52 146 134
## 485 0 0 1 1 0 51 159 182
## 486 1 1 1 1 1 73 172 176
## 487 0 1 0 1 0 73 186 150
## 488 0 1 1 0 0 46 184 131
## 489 0 0 1 1 1 72 180 170
## 490 1 1 0 1 0 76 97 143
## 491 1 1 1 1 0 53 163 152
## 492 1 0 0 0 1 43 106 161
## 493 0 0 1 0 1 52 108 142
## 494 1 0 1 1 0 48 129 185
## 495 1 1 0 0 1 68 107 156
## 496 1 0 1 1 1 70 166 173
## 497 1 0 1 0 1 59 118 181
## 498 0 0 1 0 0 54 166 154
## 499 1 1 0 1 1 47 162 163
## 500 0 1 0 0 0 65 121 175
## 501 0 0 0 1 0 78 122 161
## 502 1 1 0 0 0 42 97 142
## 503 1 0 0 0 1 44 144 157
## 504 0 0 0 0 0 42 100 141
## 505 0 0 0 1 0 41 90 144
## 506 1 1 0 0 0 43 118 195
## 507 0 1 0 1 0 52 136 198
## 508 0 0 0 0 0 48 95 137
## 509 0 1 0 0 1 61 200 167
## 510 1 1 1 0 0 43 116 158
## 511 0 0 1 1 1 75 171 165
## 512 0 0 0 1 0 54 150 180
## 513 0 1 1 1 0 55 141 153
## 514 1 1 1 1 0 68 161 146
## 515 0 1 0 1 0 68 139 143
## 516 0 0 1 0 1 43 147 145
## 517 1 1 0 1 0 71 199 197
## 518 1 1 0 0 1 59 193 131
## 519 0 0 1 0 1 70 196 180
## 520 0 0 1 1 0 76 175 137
## 521 1 0 1 0 0 67 166 132
## 522 0 1 1 1 1 73 122 172
## 523 0 0 0 0 0 56 111 160
## 524 0 1 1 1 1 61 92 169
## 525 1 0 0 1 0 47 92 160
## 526 0 0 0 1 1 75 163 140
## 527 0 1 0 0 1 76 166 178
## 528 1 1 0 1 1 44 108 176
## 529 0 0 0 1 1 40 198 173
## 530 1 0 0 0 0 71 134 199
## 531 1 1 1 1 1 67 178 182
## 532 0 1 0 1 0 45 107 132
## 533 1 1 1 1 1 63 108 134
## 534 0 0 0 0 1 56 191 142
## 535 1 0 0 0 0 70 136 134
## 536 1 0 1 0 0 55 189 167
## 537 0 1 0 1 1 72 95 135
## 538 0 1 1 1 0 74 177 154
## 539 0 1 1 0 1 51 154 140
## 540 1 1 1 0 1 48 140 142
## 541 1 0 1 0 0 48 138 195
## 542 1 1 1 1 0 76 168 176
## 543 1 1 0 1 0 72 188 146
## 544 1 1 1 0 1 56 171 170
## 545 1 1 0 1 1 53 105 133
## 546 1 0 1 0 0 68 117 193
## 547 1 1 0 1 0 44 150 163
## 548 0 1 0 1 1 43 198 170
## 549 1 1 0 1 1 70 146 132
## 550 0 0 0 1 1 56 112 162
## 551 0 1 0 1 1 43 106 157
## 552 1 1 1 0 0 72 194 154
## 553 1 0 1 1 0 66 179 188
## 554 0 0 0 0 1 74 166 137
## 555 1 1 0 1 1 72 193 130
## 556 1 1 1 1 1 78 114 155
## 557 0 0 1 1 1 77 193 170
## 558 0 1 1 1 0 49 109 190
## 559 0 1 1 0 1 74 169 194
## 560 1 1 0 0 1 69 192 142
## 561 0 1 1 1 0 78 177 137
## 562 0 0 1 0 0 67 124 188
## 563 0 1 0 1 1 51 104 187
## 564 1 1 1 1 1 43 181 138
## 565 1 0 0 0 1 60 138 192
## 566 1 1 1 1 1 61 139 133
## 567 0 1 1 1 0 68 161 147
## 568 1 1 0 1 1 66 154 139
## 569 0 1 0 1 0 72 170 149
## 570 1 0 1 1 0 58 187 186
## 571 0 1 1 1 0 73 154 153
## 572 0 0 0 0 0 73 136 169
## 573 0 1 0 0 1 67 114 146
## 574 0 0 0 1 0 59 125 193
## 575 1 1 0 0 1 78 183 178
## 576 0 0 0 0 0 63 139 197
## 577 1 0 1 0 0 51 156 182
## 578 1 0 1 1 1 79 172 193
## 579 0 0 0 0 1 43 173 141
## 580 0 0 1 1 1 50 196 196
## 581 1 0 1 0 1 61 126 193
## 582 0 1 0 1 1 62 148 166
## 583 0 1 1 1 1 63 153 132
## 584 0 1 0 1 1 44 115 177
## 585 1 1 0 1 1 64 92 134
## 586 1 1 0 0 0 40 142 163
## 587 1 1 0 1 0 56 132 167
## 588 0 1 1 0 1 60 105 188
## 589 0 1 0 1 1 69 158 146
## 590 0 1 0 1 0 63 182 172
## 591 0 1 1 1 0 69 169 153
## 592 0 0 1 1 1 45 167 167
## 593 0 1 0 1 0 47 96 167
## 594 0 1 0 0 1 72 179 164
## 595 1 0 0 0 1 42 143 157
## 596 0 1 1 1 1 73 124 199
## 597 1 1 1 1 1 67 123 154
## 598 0 1 0 0 0 43 196 171
## 599 0 1 0 1 1 69 101 175
## 600 0 1 1 0 0 49 130 130
## 601 0 0 1 0 0 46 130 135
## 602 1 1 1 0 0 71 144 178
## 603 0 0 1 0 1 74 183 164
## 604 1 1 1 0 0 74 116 184
## 605 1 1 0 1 0 61 152 172
## 606 0 0 1 1 1 42 184 148
## 607 0 0 1 1 0 66 163 147
## 608 0 0 1 1 1 50 149 182
## 609 1 1 1 1 1 49 197 158
## 610 1 1 0 1 1 49 106 196
## 611 0 0 1 0 0 47 196 171
## 612 0 1 0 1 1 68 134 137
## 613 0 0 1 0 0 74 124 196
## 614 0 1 1 1 1 54 145 140
## 615 1 1 1 0 0 61 192 135
## 616 1 0 0 0 1 70 162 163
## 617 0 1 1 1 1 40 190 164
## 618 1 0 0 0 0 66 125 157
## 619 0 0 1 1 1 55 178 149
## 620 0 1 1 0 0 70 185 149
## 621 0 1 1 1 1 44 117 169
## 622 1 1 0 1 0 69 130 183
## 623 0 1 1 1 1 64 115 182
## 624 0 0 0 1 0 56 166 188
## 625 1 1 0 1 0 50 153 147
## 626 0 1 1 0 0 49 166 188
## 627 1 0 1 1 0 62 177 144
## 628 1 1 0 0 1 42 158 152
## 629 0 1 0 1 1 42 174 178
## 630 0 0 0 0 1 69 175 189
## 631 1 0 1 1 1 79 128 195
## 632 1 0 0 0 1 40 138 172
## 633 0 0 0 1 1 59 160 155
## 634 0 0 1 0 0 49 186 137
## 635 0 1 0 0 0 46 139 172
## 636 0 0 1 0 0 72 115 131
## 637 1 0 0 1 0 67 174 189
## 638 0 1 0 0 0 62 140 157
## 639 1 1 0 0 0 69 119 181
## 640 1 0 1 1 0 47 165 160
## 641 0 0 1 0 1 74 118 187
## 642 0 0 1 0 0 53 132 148
## 643 1 1 0 0 0 61 164 153
## 644 0 1 0 0 0 45 139 132
## 645 0 0 0 0 1 65 183 148
## 646 1 1 0 0 1 54 100 161
## 647 0 0 1 0 0 40 141 173
## 648 0 0 0 0 0 46 168 144
## 649 1 0 1 1 0 52 126 133
## 650 1 0 0 1 0 57 145 143
## 651 0 0 0 0 0 48 151 181
## 652 1 0 0 1 1 61 130 185
## 653 1 1 1 1 0 74 197 147
## 654 0 1 1 1 1 56 130 138
## 655 0 0 0 0 1 42 190 157
## 656 1 1 1 1 1 61 178 162
## 657 0 0 1 0 1 66 169 138
## 658 1 1 1 1 1 43 175 153
## 659 1 1 0 1 0 58 100 189
## 660 0 1 1 0 1 79 127 155
## 661 1 1 0 0 0 77 118 152
## 662 0 1 1 0 0 61 139 181
## 663 0 1 0 0 0 71 185 161
## 664 1 1 1 0 1 70 113 147
## 665 1 1 1 0 0 67 184 163
## 666 1 1 1 0 1 62 102 151
## 667 1 0 1 0 0 48 108 145
## 668 1 1 1 1 1 52 146 130
## 669 1 1 1 0 0 78 91 200
## 670 1 1 1 1 1 58 122 143
## 671 1 1 1 1 1 75 107 172
## 672 0 1 0 0 1 77 133 173
## 673 0 1 0 1 1 73 159 190
## 674 1 0 0 1 0 79 116 149
## 675 0 0 0 1 1 76 146 168
## 676 1 0 0 0 0 72 104 172
## 677 0 0 1 0 1 67 135 182
## 678 0 1 1 0 0 44 123 196
## 679 1 1 0 0 1 60 157 171
## 680 1 0 1 1 0 43 169 150
## 681 1 1 1 0 1 42 165 186
## 682 0 1 0 0 1 70 111 134
## 683 1 1 1 0 0 49 136 164
## 684 1 0 0 1 0 70 174 151
## 685 1 1 0 0 0 48 157 181
## 686 0 1 1 1 0 46 109 158
## 687 1 0 1 1 0 60 150 142
## 688 1 1 0 1 1 63 173 133
## 689 1 0 1 1 0 54 160 195
## 690 0 1 1 0 1 71 139 138
## 691 0 0 0 0 1 67 192 132
## 692 1 0 1 0 0 69 161 143
## 693 0 1 1 0 0 65 103 169
## 694 1 1 0 0 1 55 147 156
## 695 0 1 1 0 0 68 168 184
## 696 0 1 1 1 1 49 112 132
## 697 0 1 0 0 1 58 143 168
## 698 1 0 1 1 1 66 110 149
## 699 1 0 0 0 0 75 179 131
## 700 0 1 1 0 0 71 162 144
## 701 0 0 0 0 0 56 161 198
## 702 0 1 1 1 1 43 153 135
## 703 1 0 0 1 0 75 117 199
## 704 0 0 0 0 0 58 190 163
## 705 0 1 1 1 1 79 158 176
## 706 0 0 0 0 0 51 165 174
## 707 0 1 0 1 1 65 144 191
## 708 0 1 0 1 0 42 146 188
## 709 1 1 0 1 1 45 180 165
## 710 1 0 0 0 1 41 195 159
## 711 1 1 1 1 0 57 90 157
## 712 1 1 1 0 0 52 129 147
## 713 1 0 0 0 0 46 192 193
## 714 0 0 1 1 0 51 181 158
## 715 0 0 0 0 0 49 191 157
## 716 1 0 1 1 0 59 147 137
## 717 1 0 1 1 1 56 156 193
## 718 0 0 0 0 0 50 108 153
## 719 0 0 0 0 1 57 172 175
## 720 1 1 1 0 0 67 136 149
## 721 0 1 1 0 0 45 130 145
## 722 1 1 1 1 0 59 137 170
## 723 0 1 0 0 1 53 135 134
## 724 0 1 0 0 1 42 184 194
## 725 1 0 1 1 1 76 147 169
## 726 1 0 1 1 1 78 166 188
## 727 0 0 1 1 1 61 190 192
## 728 1 1 1 0 0 44 161 135
## 729 1 1 1 0 0 64 113 143
## 730 0 0 1 0 0 77 150 145
## 731 1 1 0 0 1 57 103 181
## 732 1 1 0 0 0 44 123 183
## 733 0 1 0 1 1 64 144 139
## 734 0 0 1 1 0 48 194 169
## 735 1 1 1 0 1 46 169 171
## 736 0 1 1 0 0 59 102 200
## 737 0 0 1 0 1 54 123 198
## 738 0 1 0 0 1 68 142 162
## 739 1 0 0 0 1 59 154 132
## 740 0 1 0 0 0 79 97 147
## 741 0 1 1 1 1 71 199 134
## 742 0 0 1 1 0 43 90 156
## 743 1 1 1 0 1 52 134 143
## 744 1 0 0 0 0 40 121 186
## 745 1 1 1 1 1 40 139 166
## 746 1 0 0 0 0 40 107 194
## 747 0 0 1 0 0 63 141 169
## 748 0 1 0 0 1 77 112 186
## 749 1 0 0 1 1 73 90 134
## 750 0 1 1 0 0 56 164 200
## 751 1 1 0 1 0 72 123 163
## 752 0 0 1 1 1 63 191 170
## 753 0 1 0 1 0 78 90 176
## 754 1 1 0 1 1 54 93 133
## 755 0 1 1 0 0 55 116 169
## 756 1 0 1 0 0 69 178 180
## 757 0 0 1 0 1 76 140 165
## 758 0 1 1 0 0 70 179 153
## 759 1 0 1 0 0 78 175 168
## 760 0 1 1 1 0 68 108 159
## 761 0 1 1 1 1 66 166 154
## 762 0 0 1 1 0 54 102 154
## 763 1 1 0 0 1 67 156 133
## 764 0 0 0 1 0 53 119 182
## 765 0 0 1 1 0 50 200 166
## 766 0 0 1 1 1 77 115 181
## 767 1 1 0 1 0 78 101 168
## 768 1 1 1 0 1 59 108 168
## 769 1 1 0 0 0 79 146 163
## 770 0 0 0 0 0 78 154 166
## 771 0 0 0 0 0 65 144 194
## 772 0 0 1 0 1 79 128 130
## 773 0 1 0 0 0 63 161 135
## 774 0 1 1 1 0 65 121 146
## 775 1 1 0 0 0 47 142 153
## 776 1 1 1 0 1 49 118 192
## 777 0 1 0 1 1 50 153 187
## 778 0 0 0 1 0 40 198 190
## 779 0 1 1 1 0 61 120 184
## 780 1 0 0 1 1 42 98 166
## 781 0 1 0 1 1 64 103 169
## 782 1 1 1 1 0 47 112 189
## 783 1 0 0 0 0 69 113 197
## 784 1 1 0 1 0 43 169 191
## 785 1 0 0 0 0 77 118 192
## 786 0 0 1 0 0 58 163 187
## 787 0 0 0 1 0 72 121 147
## 788 0 1 1 0 0 47 129 154
## 789 0 0 1 1 1 76 94 191
## 790 1 0 1 1 1 70 99 150
## 791 1 1 1 1 0 56 136 189
## 792 1 1 0 1 1 46 164 160
## 793 0 1 1 1 0 52 159 180
## 794 0 1 0 1 0 65 111 172
## 795 0 0 0 0 0 66 96 190
## 796 0 0 0 1 0 42 100 144
## 797 1 0 1 0 0 43 188 175
## 798 1 0 1 1 0 71 155 181
## 799 1 1 0 1 0 60 153 162
## 800 1 1 1 1 1 43 133 158
## 801 0 0 1 1 0 79 159 194
## 802 1 1 0 1 1 40 124 188
## 803 1 1 1 1 1 40 171 181
## 804 0 1 0 0 0 46 116 141
## 805 0 1 0 0 0 40 146 137
## 806 1 0 1 1 1 50 176 136
## 807 0 0 1 1 1 75 107 190
## 808 1 1 0 1 0 62 112 177
## 809 0 1 1 1 0 46 93 179
## 810 0 1 0 0 0 51 200 144
## 811 1 1 0 1 0 70 119 150
## 812 0 1 0 1 1 50 123 136
## 813 0 0 1 0 0 75 188 137
## 814 1 0 0 0 1 60 147 161
## 815 0 1 1 1 0 52 145 136
## 816 1 0 1 0 0 75 184 194
## 817 0 0 0 1 1 69 178 164
## 818 0 1 1 1 1 44 146 170
## 819 1 0 1 0 0 76 96 177
## 820 1 0 0 1 0 51 91 198
## 821 0 1 1 0 1 44 140 177
## 822 0 1 1 1 1 58 156 194
## 823 0 1 0 1 1 78 130 164
## 824 1 1 1 1 0 59 122 159
## 825 1 1 1 1 0 43 136 169
## 826 0 1 0 1 0 69 153 180
## 827 0 1 0 1 1 48 102 167
## 828 0 0 0 1 1 63 182 159
## 829 0 0 1 0 0 47 198 145
## 830 0 0 1 1 0 50 131 179
## 831 0 0 0 0 0 58 100 182
## 832 0 0 0 1 1 54 90 183
## 833 1 0 1 1 0 42 154 150
## 834 0 1 0 0 0 57 112 146
## 835 0 0 0 1 0 40 171 140
## 836 0 0 1 0 1 49 123 158
## 837 1 1 1 0 1 40 118 189
## 838 1 0 1 1 0 40 117 151
## 839 0 0 0 1 0 48 91 146
## 840 0 1 1 1 0 54 112 151
## 841 0 1 1 0 0 51 107 184
## 842 1 0 0 1 1 51 164 180
## 843 0 1 1 0 0 49 198 198
## 844 1 1 1 1 1 63 191 165
## 845 1 0 0 1 0 73 101 170
## 846 0 1 1 1 1 60 94 195
## 847 1 0 0 0 1 75 182 179
## 848 1 0 0 1 1 55 117 160
## 849 0 0 1 1 1 77 137 194
## 850 1 1 0 0 1 79 115 182
## 851 1 1 1 1 0 67 200 170
## 852 1 0 0 1 1 59 123 183
## 853 0 0 0 1 0 63 127 190
## 854 1 1 0 0 1 62 110 195
## 855 0 0 0 0 0 52 150 184
## 856 1 0 1 0 0 56 161 180
## 857 0 1 0 0 0 48 186 163
## 858 1 1 1 0 0 55 129 147
## 859 0 0 0 1 0 52 137 154
## 860 0 0 1 0 1 68 163 178
## 861 0 1 0 0 0 57 159 169
## 862 1 1 0 0 0 54 198 166
## 863 1 0 1 0 0 55 170 194
## 864 0 1 1 1 1 58 111 200
## 865 1 0 0 0 0 71 144 199
## 866 1 0 0 0 0 62 107 170
## 867 0 1 0 1 1 75 120 174
## 868 0 1 1 1 0 50 197 163
## 869 0 0 1 1 1 63 183 152
## 870 0 0 1 0 0 56 170 172
## 871 1 1 0 1 1 64 133 176
## 872 1 1 1 1 0 60 119 147
## 873 0 0 1 1 0 61 116 181
## 874 1 1 1 0 1 45 154 139
## 875 1 0 0 1 0 42 114 172
## 876 0 1 0 0 0 71 188 171
## 877 0 1 1 1 0 65 195 149
## 878 0 0 1 1 1 40 186 144
## 879 0 1 1 0 1 46 129 153
## 880 0 0 0 1 0 50 180 131
## 881 1 1 1 1 1 55 141 158
## 882 0 0 0 0 1 52 192 170
## 883 0 1 0 0 1 57 99 182
## 884 1 0 1 0 1 61 170 185
## 885 0 1 0 0 0 76 157 136
## 886 1 0 0 1 1 59 189 141
## 887 0 0 0 0 1 48 96 134
## 888 0 0 1 0 0 41 108 183
## 889 1 1 1 0 1 56 125 161
## 890 1 1 0 1 0 61 171 144
## 891 1 1 1 0 1 47 138 197
## 892 1 1 1 0 1 51 161 194
## 893 1 1 0 0 1 48 152 186
## 894 0 0 0 0 1 52 128 146
## 895 1 1 1 0 1 40 144 185
## 896 1 0 1 1 0 52 101 137
## 897 0 0 1 0 0 79 103 133
## 898 0 1 1 1 0 61 119 188
## 899 1 1 0 0 0 53 181 149
## 900 0 0 1 1 0 42 114 144
## 901 0 0 1 1 1 54 180 137
## 902 1 0 0 1 1 65 110 134
## 903 0 1 0 1 1 61 160 199
## 904 1 1 1 0 1 47 176 156
## 905 1 1 0 0 0 57 134 141
## 906 0 1 0 0 0 46 149 161
## 907 1 0 0 1 1 78 153 134
## 908 1 1 1 0 0 52 90 174
## 909 0 0 1 1 1 50 94 192
## 910 1 0 0 1 0 79 196 170
## 911 0 1 1 0 1 57 96 133
## 912 1 1 1 0 1 40 117 157
## 913 1 0 1 1 1 78 121 184
## 914 1 1 0 1 0 45 153 199
## 915 1 1 0 0 1 52 142 151
## 916 1 0 1 0 1 42 125 185
## 917 0 1 1 0 1 54 141 163
## 918 1 0 1 1 0 75 122 170
## 919 1 0 0 0 0 62 175 141
## 920 1 0 1 0 1 58 159 173
## 921 0 1 0 1 1 44 165 148
## 922 0 0 1 1 1 68 125 167
## 923 0 0 1 1 0 45 111 140
## 924 0 0 1 0 0 46 115 163
## 925 0 0 0 0 1 51 103 168
## 926 1 1 1 0 1 67 185 193
## 927 1 0 1 0 1 41 115 155
## 928 1 1 0 0 0 77 97 179
## 929 1 0 0 0 1 45 112 194
## 930 1 0 1 1 0 47 109 199
## 931 0 1 1 1 0 57 107 158
## 932 0 0 1 1 1 54 192 181
## 933 0 0 0 1 0 66 117 192
## 934 1 0 0 1 1 65 116 151
## 935 0 1 0 1 0 61 107 180
## 936 1 1 1 0 0 65 129 135
## 937 0 1 1 0 0 74 158 197
## 938 0 0 1 1 1 71 187 184
## 939 0 0 1 0 0 77 141 133
## 940 0 1 0 1 1 77 107 155
## 941 1 1 1 1 1 67 102 159
## 942 0 0 1 1 0 52 128 144
## 943 0 0 0 1 0 58 156 141
## 944 1 0 0 1 1 56 136 143
## 945 0 0 0 0 1 72 196 150
## 946 1 0 0 1 0 43 189 144
## 947 0 0 0 0 1 56 118 185
## 948 1 0 0 0 1 48 185 131
## 949 1 1 1 1 1 79 104 169
## 950 0 1 0 1 1 57 138 139
## 951 0 1 0 0 1 51 96 162
## 952 0 0 0 1 1 78 189 189
## 953 1 1 0 1 1 67 134 164
## 954 0 1 0 1 1 69 133 164
## 955 1 1 1 0 0 55 184 179
## 956 1 0 1 1 0 42 167 153
## 957 1 1 1 1 1 65 137 133
## 958 1 1 1 0 0 59 126 169
## 959 0 1 1 0 1 40 171 158
## 960 0 0 1 1 1 46 156 192
## 961 0 1 1 0 0 46 177 138
## 962 1 0 0 0 0 52 116 144
## 963 1 0 1 1 0 71 188 191
## 964 1 1 1 1 0 46 103 133
## 965 1 0 1 1 0 40 152 166
## 966 1 0 1 0 1 54 162 166
## 967 1 1 1 1 1 52 190 162
## 968 1 1 1 1 1 64 145 197
## 969 0 0 1 1 0 43 169 177
## 970 1 1 0 0 0 73 173 200
## 971 1 1 1 0 0 66 162 146
## 972 1 1 1 0 1 42 181 155
## 973 1 1 1 1 1 67 178 156
## 974 1 1 0 0 1 65 135 136
## 975 0 0 1 0 0 58 105 135
## 976 0 0 0 1 1 48 196 177
## 977 0 1 1 1 1 68 162 137
## 978 0 1 0 1 1 45 184 146
## 979 1 1 1 0 1 77 160 142
## 980 1 1 0 1 0 78 182 179
## 981 0 0 1 1 1 79 134 194
## 982 0 0 0 1 0 59 181 174
## 983 1 0 0 0 1 58 129 165
## 984 0 1 0 1 0 76 114 186
## 985 0 0 1 1 1 70 125 189
## 986 1 0 0 1 0 62 177 145
## 987 1 1 0 0 1 56 127 135
## 988 1 0 0 1 0 41 116 140
## 989 1 1 1 1 0 75 172 160
## 990 1 0 0 1 1 53 104 151
## 991 0 1 1 0 0 48 174 193
## 992 1 0 1 1 1 48 120 150
## 993 1 1 0 1 1 65 100 191
## 994 1 0 0 0 1 51 125 140
## 995 1 0 0 0 0 47 179 164
## 996 1 0 1 1 1 42 110 175
## 997 0 1 0 1 0 75 123 130
## 998 1 1 1 0 1 76 178 179
## 999 0 1 1 1 0 65 139 169
## 1000 0 1 0 0 0 57 182 146
## HDL Risk
## 1 32 11.1
## 2 59 30.1
## 3 59 37.6
## 4 46 13.2
## 5 63 15.1
## 6 22 17.3
## 7 52 2.1
## 8 59 46.0
## 9 99 1.7
## 10 46 48.5
## 11 48 50.3
## 12 44 22.9
## 13 48 1.0
## 14 71 19.7
## 15 95 5.1
## 16 72 17.0
## 17 40 24.2
## 18 51 32.0
## 19 92 53.0
## 20 48 37.3
## 21 46 10.3
## 22 57 7.9
## 23 62 4.9
## 24 23 44.6
## 25 68 23.9
## 26 38 28.8
## 27 24 18.5
## 28 90 13.5
## 29 23 32.6
## 30 96 18.8
## 31 49 9.0
## 32 93 0.4
## 33 88 3.7
## 34 50 24.0
## 35 81 25.1
## 36 93 23.3
## 37 91 56.4
## 38 70 8.1
## 39 86 15.6
## 40 50 9.7
## 41 41 60.5
## 42 90 8.5
## 43 97 8.0
## 44 45 1.4
## 45 58 0.1
## 46 72 23.7
## 47 30 23.7
## 48 42 3.2
## 49 39 0.8
## 50 65 68.5
## 51 32 33.9
## 52 87 35.0
## 53 29 19.6
## 54 21 17.5
## 55 39 51.2
## 56 76 1.6
## 57 52 49.4
## 58 49 1.3
## 59 35 13.4
## 60 93 13.1
## 61 89 1.6
## 62 65 5.2
## 63 61 4.1
## 64 20 60.8
## 65 54 16.9
## 66 63 32.1
## 67 45 10.4
## 68 46 9.4
## 69 98 22.5
## 70 95 7.4
## 71 48 8.2
## 72 82 19.9
## 73 90 55.2
## 74 23 29.1
## 75 39 3.4
## 76 29 24.1
## 77 23 7.0
## 78 100 2.3
## 79 39 32.5
## 80 93 56.1
## 81 33 27.3
## 82 50 44.8
## 83 62 4.9
## 84 79 22.2
## 85 57 10.5
## 86 21 32.0
## 87 44 10.8
## 88 23 6.6
## 89 86 2.5
## 90 34 11.4
## 91 68 0.9
## 92 79 18.3
## 93 27 17.7
## 94 31 29.0
## 95 90 29.5
## 96 94 0.7
## 97 54 11.7
## 98 30 55.3
## 99 63 3.3
## 100 64 52.5
## 101 91 4.2
## 102 28 26.9
## 103 38 1.7
## 104 97 18.0
## 105 28 8.3
## 106 78 43.1
## 107 85 3.1
## 108 51 1.2
## 109 29 15.6
## 110 92 3.1
## 111 33 3.0
## 112 99 37.0
## 113 94 5.2
## 114 37 36.7
## 115 62 0.8
## 116 70 11.6
## 117 92 0.3
## 118 40 6.7
## 119 53 78.1
## 120 49 54.5
## 121 69 10.4
## 122 57 12.0
## 123 30 28.6
## 124 46 6.1
## 125 61 11.9
## 126 30 12.8
## 127 77 4.8
## 128 44 4.3
## 129 28 15.6
## 130 25 25.3
## 131 93 1.9
## 132 71 2.5
## 133 85 27.7
## 134 65 46.5
## 135 25 20.3
## 136 27 35.5
## 137 81 14.1
## 138 46 4.9
## 139 97 3.2
## 140 36 13.7
## 141 39 24.0
## 142 64 4.3
## 143 96 37.1
## 144 78 32.2
## 145 69 21.2
## 146 41 10.3
## 147 64 17.9
## 148 61 43.4
## 149 96 7.1
## 150 81 1.5
## 151 50 32.5
## 152 55 49.7
## 153 55 5.2
## 154 38 7.1
## 155 69 5.5
## 156 23 43.0
## 157 77 23.7
## 158 50 5.9
## 159 65 33.8
## 160 31 5.8
## 161 97 23.4
## 162 38 32.3
## 163 77 30.1
## 164 42 15.9
## 165 33 17.0
## 166 69 5.0
## 167 86 0.6
## 168 23 7.9
## 169 63 3.8
## 170 63 48.3
## 171 29 46.4
## 172 22 11.4
## 173 87 9.2
## 174 99 24.5
## 175 39 23.1
## 176 80 1.3
## 177 91 5.3
## 178 50 35.0
## 179 34 5.1
## 180 63 3.3
## 181 22 10.9
## 182 51 0.4
## 183 20 35.2
## 184 35 5.4
## 185 96 5.4
## 186 64 26.2
## 187 85 10.1
## 188 88 37.2
## 189 29 10.7
## 190 72 23.4
## 191 74 36.2
## 192 65 14.6
## 193 93 7.4
## 194 74 42.9
## 195 83 4.6
## 196 34 24.3
## 197 95 6.7
## 198 93 16.4
## 199 79 54.4
## 200 59 0.3
## 201 45 65.0
## 202 66 20.4
## 203 60 10.8
## 204 96 9.8
## 205 76 51.1
## 206 42 29.6
## 207 56 12.1
## 208 44 42.0
## 209 72 38.5
## 210 75 9.6
## 211 25 33.0
## 212 58 27.3
## 213 79 62.4
## 214 71 4.8
## 215 63 20.8
## 216 21 4.0
## 217 60 29.2
## 218 25 58.2
## 219 20 18.2
## 220 64 24.4
## 221 56 43.3
## 222 96 18.1
## 223 97 0.3
## 224 59 17.1
## 225 44 70.6
## 226 64 15.8
## 227 83 13.0
## 228 45 41.1
## 229 61 28.1
## 230 73 23.7
## 231 35 31.1
## 232 59 18.3
## 233 72 13.3
## 234 83 48.9
## 235 26 9.9
## 236 57 5.3
## 237 66 45.9
## 238 52 17.1
## 239 47 10.6
## 240 62 50.5
## 241 54 4.2
## 242 92 16.1
## 243 33 15.5
## 244 85 0.7
## 245 50 6.6
## 246 83 4.9
## 247 39 4.2
## 248 69 33.8
## 249 100 5.4
## 250 28 1.8
## 251 96 9.1
## 252 67 8.6
## 253 41 4.9
## 254 48 11.2
## 255 94 14.1
## 256 29 18.7
## 257 70 2.6
## 258 56 27.6
## 259 44 7.9
## 260 99 8.6
## 261 80 0.4
## 262 94 16.2
## 263 80 3.2
## 264 84 1.5
## 265 26 13.1
## 266 54 0.6
## 267 73 8.7
## 268 37 15.2
## 269 69 49.6
## 270 80 4.8
## 271 81 47.3
## 272 46 38.0
## 273 40 7.8
## 274 72 6.0
## 275 61 10.1
## 276 69 5.3
## 277 100 33.1
## 278 90 28.4
## 279 84 23.2
## 280 72 3.8
## 281 41 28.0
## 282 46 17.4
## 283 25 23.6
## 284 41 40.9
## 285 29 3.9
## 286 78 7.1
## 287 76 0.9
## 288 64 11.4
## 289 84 11.9
## 290 21 32.0
## 291 94 19.8
## 292 98 5.9
## 293 38 52.1
## 294 21 71.2
## 295 58 20.3
## 296 37 7.8
## 297 95 22.1
## 298 33 17.2
## 299 81 11.0
## 300 40 9.9
## 301 95 5.0
## 302 28 34.1
## 303 85 15.7
## 304 20 9.8
## 305 53 8.7
## 306 32 47.4
## 307 48 25.2
## 308 27 6.3
## 309 63 55.7
## 310 78 11.8
## 311 97 0.6
## 312 88 56.5
## 313 99 27.0
## 314 88 67.5
## 315 48 21.1
## 316 97 17.0
## 317 46 12.8
## 318 99 9.4
## 319 93 7.4
## 320 92 11.0
## 321 48 24.4
## 322 72 62.7
## 323 30 55.6
## 324 100 15.0
## 325 58 31.0
## 326 40 14.3
## 327 95 1.5
## 328 31 45.1
## 329 54 17.1
## 330 31 42.4
## 331 99 10.6
## 332 24 45.4
## 333 76 30.9
## 334 45 2.2
## 335 46 16.4
## 336 86 4.1
## 337 81 1.3
## 338 58 5.2
## 339 84 11.6
## 340 43 4.6
## 341 46 2.3
## 342 96 7.7
## 343 64 8.3
## 344 65 1.1
## 345 65 24.2
## 346 36 23.0
## 347 69 2.3
## 348 84 2.4
## 349 90 12.1
## 350 90 3.1
## 351 91 7.5
## 352 67 36.2
## 353 38 4.5
## 354 44 35.3
## 355 54 4.2
## 356 61 2.3
## 357 80 7.9
## 358 83 0.6
## 359 90 0.8
## 360 63 40.7
## 361 86 18.8
## 362 35 20.8
## 363 38 14.0
## 364 45 20.8
## 365 28 21.7
## 366 50 8.6
## 367 58 1.1
## 368 38 9.5
## 369 76 17.9
## 370 72 68.0
## 371 53 8.1
## 372 22 8.5
## 373 27 36.5
## 374 73 1.6
## 375 50 8.6
## 376 41 27.9
## 377 76 13.3
## 378 33 15.7
## 379 35 27.3
## 380 92 16.4
## 381 93 9.6
## 382 60 25.1
## 383 75 7.0
## 384 83 48.1
## 385 93 32.6
## 386 98 20.6
## 387 84 23.7
## 388 47 26.5
## 389 100 35.5
## 390 61 22.2
## 391 31 24.0
## 392 51 30.3
## 393 50 3.6
## 394 92 1.3
## 395 33 52.0
## 396 36 17.8
## 397 75 4.4
## 398 33 23.9
## 399 69 7.3
## 400 42 41.2
## 401 49 10.7
## 402 34 21.6
## 403 80 12.7
## 404 64 14.4
## 405 38 8.7
## 406 88 1.8
## 407 60 39.8
## 408 30 11.8
## 409 73 56.1
## 410 41 27.8
## 411 26 30.5
## 412 41 4.6
## 413 47 2.4
## 414 91 0.3
## 415 23 25.9
## 416 26 40.6
## 417 48 5.6
## 418 53 25.4
## 419 99 36.1
## 420 85 5.9
## 421 69 4.4
## 422 70 12.7
## 423 91 3.1
## 424 70 4.2
## 425 38 5.8
## 426 65 35.9
## 427 41 26.4
## 428 77 15.5
## 429 44 13.2
## 430 46 6.0
## 431 53 29.0
## 432 66 5.6
## 433 84 38.1
## 434 27 14.6
## 435 85 20.9
## 436 69 12.7
## 437 43 23.3
## 438 26 34.5
## 439 60 8.4
## 440 33 11.7
## 441 33 27.7
## 442 44 5.2
## 443 20 38.8
## 444 28 13.5
## 445 88 50.7
## 446 59 50.6
## 447 36 1.2
## 448 85 56.8
## 449 79 12.7
## 450 33 18.2
## 451 49 5.2
## 452 89 2.2
## 453 97 13.8
## 454 73 20.9
## 455 63 14.1
## 456 25 46.8
## 457 45 2.5
## 458 59 1.7
## 459 31 21.6
## 460 95 16.7
## 461 81 30.1
## 462 76 13.3
## 463 47 16.8
## 464 47 4.8
## 465 89 7.9
## 466 29 9.8
## 467 80 34.3
## 468 80 21.8
## 469 94 30.0
## 470 55 13.8
## 471 100 21.0
## 472 62 11.1
## 473 39 22.0
## 474 49 23.7
## 475 26 72.6
## 476 23 5.7
## 477 40 11.0
## 478 87 2.6
## 479 49 34.5
## 480 38 44.2
## 481 70 0.1
## 482 82 7.8
## 483 45 31.1
## 484 40 9.4
## 485 25 22.7
## 486 68 69.5
## 487 91 46.0
## 488 93 3.2
## 489 78 61.3
## 490 47 15.0
## 491 54 25.1
## 492 41 1.2
## 493 86 1.5
## 494 47 12.1
## 495 82 10.5
## 496 51 55.7
## 497 90 7.7
## 498 97 3.2
## 499 100 16.4
## 500 90 5.4
## 501 93 37.5
## 502 79 1.7
## 503 55 1.5
## 504 79 0.1
## 505 91 0.1
## 506 27 3.8
## 507 65 5.6
## 508 79 0.2
## 509 58 19.7
## 510 93 4.1
## 511 38 69.3
## 512 37 6.2
## 513 64 13.5
## 514 46 41.9
## 515 84 19.4
## 516 95 1.0
## 517 48 42.5
## 518 31 26.5
## 519 76 37.9
## 520 65 63.1
## 521 56 21.4
## 522 53 43.1
## 523 88 0.9
## 524 99 8.4
## 525 38 2.6
## 526 100 51.8
## 527 69 22.2
## 528 45 8.7
## 529 91 1.0
## 530 75 17.6
## 531 60 65.9
## 532 24 3.6
## 533 37 31.3
## 534 28 8.4
## 535 29 20.7
## 536 53 15.4
## 537 32 11.5
## 538 88 71.0
## 539 73 7.4
## 540 100 11.7
## 541 71 5.0
## 542 75 50.5
## 543 23 44.6
## 544 45 30.2
## 545 49 11.5
## 546 85 13.1
## 547 73 8.1
## 548 69 20.6
## 549 77 32.8
## 550 56 3.5
## 551 74 0.8
## 552 84 32.2
## 553 84 39.3
## 554 58 28.7
## 555 59 54.5
## 556 80 44.1
## 557 81 85.4
## 558 41 6.1
## 559 75 41.0
## 560 92 28.1
## 561 24 50.7
## 562 33 12.9
## 563 93 1.8
## 564 61 28.8
## 565 32 15.0
## 566 57 38.8
## 567 24 46.9
## 568 98 30.2
## 569 58 35.5
## 570 63 31.6
## 571 64 56.1
## 572 63 14.0
## 573 39 6.2
## 574 21 10.4
## 575 40 43.1
## 576 85 4.2
## 577 78 6.5
## 578 98 69.4
## 579 77 0.5
## 580 76 13.4
## 581 37 19.1
## 582 87 17.6
## 583 71 32.8
## 584 59 2.3
## 585 52 13.9
## 586 86 3.0
## 587 98 10.6
## 588 86 5.5
## 589 24 27.6
## 590 20 40.8
## 591 28 51.9
## 592 24 29.4
## 593 86 0.4
## 594 31 17.1
## 595 46 1.4
## 596 22 40.5
## 597 95 35.4
## 598 57 5.3
## 599 26 14.2
## 600 60 2.1
## 601 50 1.8
## 602 88 20.2
## 603 66 46.9
## 604 28 21.5
## 605 100 16.4
## 606 71 4.8
## 607 90 22.7
## 608 47 12.8
## 609 44 47.1
## 610 72 9.8
## 611 49 6.4
## 612 48 19.0
## 613 26 20.1
## 614 50 24.3
## 615 78 22.0
## 616 90 21.5
## 617 67 28.0
## 618 39 12.9
## 619 99 10.0
## 620 34 30.7
## 621 40 10.0
## 622 38 22.2
## 623 44 27.5
## 624 97 4.0
## 625 24 15.1
## 626 61 7.8
## 627 42 37.9
## 628 94 6.5
## 629 40 32.0
## 630 97 17.6
## 631 98 50.2
## 632 67 0.7
## 633 59 9.1
## 634 26 10.5
## 635 97 0.6
## 636 60 13.4
## 637 58 37.4
## 638 45 6.8
## 639 96 8.1
## 640 28 23.7
## 641 27 23.8
## 642 87 2.0
## 643 35 13.4
## 644 68 0.7
## 645 79 10.9
## 646 20 8.2
## 647 62 1.7
## 648 57 0.8
## 649 58 8.2
## 650 76 7.8
## 651 55 1.2
## 652 99 12.2
## 653 83 55.3
## 654 36 24.5
## 655 77 0.7
## 656 88 51.9
## 657 52 20.9
## 658 91 25.2
## 659 91 7.5
## 660 38 22.0
## 661 53 11.6
## 662 71 10.9
## 663 24 17.3
## 664 22 28.5
## 665 20 37.3
## 666 71 13.4
## 667 32 5.1
## 668 77 28.1
## 669 63 13.2
## 670 52 29.9
## 671 40 45.0
## 672 40 13.3
## 673 65 40.2
## 674 95 38.1
## 675 36 47.6
## 676 69 12.2
## 677 25 21.2
## 678 74 0.9
## 679 97 15.2
## 680 94 3.8
## 681 60 14.4
## 682 47 6.3
## 683 92 7.5
## 684 60 40.6
## 685 40 7.1
## 686 69 1.5
## 687 97 15.2
## 688 83 34.0
## 689 99 14.0
## 690 64 20.2
## 691 25 21.1
## 692 58 23.6
## 693 53 8.0
## 694 41 13.8
## 695 87 28.8
## 696 65 4.5
## 697 29 11.3
## 698 25 34.6
## 699 83 30.9
## 700 55 26.1
## 701 90 2.2
## 702 66 10.4
## 703 82 32.5
## 704 76 3.7
## 705 74 73.4
## 706 66 1.5
## 707 75 23.7
## 708 62 2.5
## 709 28 25.3
## 710 39 2.8
## 711 30 13.3
## 712 61 8.6
## 713 70 2.9
## 714 84 6.3
## 715 44 2.4
## 716 57 19.1
## 717 31 42.2
## 718 72 0.4
## 719 25 9.8
## 720 99 14.9
## 721 36 3.7
## 722 93 21.4
## 723 86 2.2
## 724 71 6.4
## 725 37 62.7
## 726 58 70.1
## 727 86 28.0
## 728 79 7.7
## 729 68 10.8
## 730 47 33.9
## 731 28 9.2
## 732 60 3.3
## 733 85 16.5
## 734 56 10.2
## 735 86 16.3
## 736 26 7.7
## 737 69 4.4
## 738 38 11.1
## 739 24 14.6
## 740 84 15.4
## 741 25 58.9
## 742 26 5.7
## 743 38 17.1
## 744 38 1.3
## 745 69 16.0
## 746 20 2.9
## 747 60 9.0
## 748 74 15.3
## 749 23 29.0
## 750 49 16.6
## 751 100 16.5
## 752 53 38.7
## 753 45 23.5
## 754 37 10.5
## 755 100 2.2
## 756 28 41.0
## 757 94 37.3
## 758 95 32.1
## 759 99 38.4
## 760 29 25.0
## 761 36 52.7
## 762 67 3.4
## 763 47 22.0
## 764 48 2.8
## 765 35 20.5
## 766 83 49.3
## 767 71 15.9
## 768 29 17.6
## 769 82 15.9
## 770 61 30.3
## 771 57 6.8
## 772 45 39.3
## 773 97 6.7
## 774 29 25.6
## 775 86 4.3
## 776 45 12.3
## 777 22 45.0
## 778 60 2.0
## 779 44 21.3
## 780 82 0.7
## 781 99 8.4
## 782 53 11.0
## 783 34 16.3
## 784 69 10.0
## 785 92 20.6
## 786 77 6.9
## 787 54 18.3
## 788 69 1.5
## 789 66 32.9
## 790 47 26.7
## 791 74 20.7
## 792 91 16.3
## 793 40 29.4
## 794 77 10.5
## 795 23 4.9
## 796 86 0.2
## 797 65 5.1
## 798 50 48.5
## 799 23 23.5
## 800 56 18.3
## 801 21 60.4
## 802 33 10.0
## 803 23 31.2
## 804 28 1.9
## 805 29 3.7
## 806 24 33.2
## 807 96 37.6
## 808 21 16.0
## 809 22 6.2
## 810 37 14.0
## 811 94 14.6
## 812 26 13.3
## 813 97 41.3
## 814 80 7.5
## 815 66 9.7
## 816 56 42.8
## 817 55 34.5
## 818 75 8.7
## 819 75 15.1
## 820 41 4.9
## 821 73 3.3
## 822 48 44.8
## 823 82 38.7
## 824 58 19.8
## 825 40 13.1
## 826 56 29.9
## 827 25 7.5
## 828 84 15.6
## 829 55 4.3
## 830 99 3.0
## 831 70 1.3
## 832 34 3.4
## 833 38 9.5
## 834 96 1.2
## 835 44 1.5
## 836 43 4.1
## 837 30 8.6
## 838 39 4.9
## 839 69 0.5
## 840 60 6.0
## 841 55 2.0
## 842 82 6.9
## 843 86 10.0
## 844 83 60.5
## 845 34 29.0
## 846 46 13.6
## 847 22 55.3
## 848 29 13.0
## 849 70 61.4
## 850 39 21.5
## 851 28 61.1
## 852 23 25.3
## 853 64 7.6
## 854 86 9.3
## 855 86 1.1
## 856 82 9.6
## 857 85 3.1
## 858 21 13.4
## 859 84 1.5
## 860 49 25.9
## 861 54 6.7
## 862 94 10.7
## 863 20 32.8
## 864 66 15.1
## 865 28 28.9
## 866 36 8.1
## 867 97 32.6
## 868 35 55.5
## 869 27 45.1
## 870 20 18.0
## 871 60 27.0
## 872 97 16.8
## 873 66 9.7
## 874 31 16.5
## 875 74 1.0
## 876 40 20.1
## 877 90 51.6
## 878 38 13.2
## 879 82 2.0
## 880 61 2.3
## 881 86 30.4
## 882 24 9.2
## 883 25 4.3
## 884 48 25.2
## 885 32 13.2
## 886 36 29.1
## 887 61 0.4
## 888 72 0.9
## 889 97 14.2
## 890 32 25.7
## 891 79 12.7
## 892 20 28.8
## 893 43 11.1
## 894 43 1.8
## 895 20 13.8
## 896 54 6.3
## 897 30 20.1
## 898 87 15.4
## 899 98 8.4
## 900 47 2.7
## 901 35 22.1
## 902 36 19.2
## 903 81 24.0
## 904 20 26.5
## 905 22 9.0
## 906 44 2.9
## 907 63 60.2
## 908 63 4.8
## 909 22 14.1
## 910 51 76.8
## 911 71 2.4
## 912 48 7.0
## 913 37 53.8
## 914 89 8.8
## 915 71 9.7
## 916 40 5.7
## 917 38 15.0
## 918 46 41.4
## 919 57 11.8
## 920 96 10.9
## 921 66 9.0
## 922 67 26.1
## 923 21 9.9
## 924 50 2.0
## 925 80 0.6
## 926 53 47.4
## 927 23 7.0
## 928 100 7.2
## 929 34 2.9
## 930 35 13.2
## 931 76 6.3
## 932 71 17.3
## 933 69 8.9
## 934 92 13.5
## 935 42 8.4
## 936 82 13.0
## 937 84 41.4
## 938 83 60.5
## 939 50 30.8
## 940 74 27.1
## 941 69 28.9
## 942 28 10.7
## 943 58 5.3
## 944 79 7.1
## 945 27 33.3
## 946 57 2.8
## 947 65 2.0
## 948 80 2.0
## 949 53 44.4
## 950 59 11.2
## 951 25 2.7
## 952 98 78.5
## 953 27 36.0
## 954 95 23.0
## 955 81 17.5
## 956 24 20.3
## 957 98 37.7
## 958 82 10.7
## 959 75 5.5
## 960 20 39.3
## 961 70 4.7
## 962 91 1.3
## 963 20 76.5
## 964 70 7.6
## 965 20 22.7
## 966 59 11.8
## 967 85 43.2
## 968 25 58.1
## 969 53 7.2
## 970 56 20.6
## 971 73 20.7
## 972 58 16.2
## 973 55 65.2
## 974 21 20.2
## 975 48 3.2
## 976 31 11.3
## 977 82 46.3
## 978 98 7.7
## 979 99 40.3
## 980 34 47.3
## 981 67 67.0
## 982 54 9.9
## 983 41 8.3
## 984 68 33.2
## 985 94 30.4
## 986 82 17.9
## 987 22 12.7
## 988 58 0.8
## 989 64 51.5
## 990 69 3.7
## 991 34 21.0
## 992 29 15.4
## 993 84 16.0
## 994 69 1.9
## 995 58 2.8
## 996 67 3.9
## 997 56 23.9
## 998 30 61.3
## 999 66 32.3
## 1000 38 11.7
str(dataset)
## 'data.frame': 1000 obs. of 10 variables:
## $ isMale : int 1 0 0 1 0 0 1 1 0 1 ...
## $ isBlack : int 1 0 1 1 0 0 0 0 0 0 ...
## $ isSmoker : int 0 0 1 1 1 1 1 1 1 0 ...
## $ isDiabetic : int 1 1 1 1 0 0 0 1 0 1 ...
## $ isHypertensive: int 1 1 1 0 1 1 0 0 1 1 ...
## $ Age : int 49 69 50 42 66 52 40 75 42 65 ...
## $ Systolic : int 101 167 181 145 134 154 104 136 169 196 ...
## $ Cholesterol : int 181 155 147 166 199 174 187 189 179 187 ...
## $ HDL : int 32 59 59 46 63 22 52 59 99 46 ...
## $ Risk : num 11.1 30.1 37.6 13.2 15.1 17.3 2.1 46 1.7 48.5 ...
dim(dataset)
## [1] 1000 10
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(dataset)
## dataset
##
## 10 Variables 1000 Observations
## --------------------------------------------------------------------------------
## isMale
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.75 490 0.49 0.5003
##
## --------------------------------------------------------------------------------
## isBlack
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.747 530 0.53 0.4987
##
## --------------------------------------------------------------------------------
## isSmoker
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.749 516 0.516 0.5
##
## --------------------------------------------------------------------------------
## isDiabetic
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.749 522 0.522 0.4995
##
## --------------------------------------------------------------------------------
## isHypertensive
## n missing distinct Info Sum Mean Gmd
## 1000 0 2 0.75 495 0.495 0.5005
##
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 40 0.999 59.11 13.32 42 43
## .25 .50 .75 .90 .95
## 49 59 69 75 77
##
## lowest : 40 41 42 43 44, highest: 75 76 77 78 79
## --------------------------------------------------------------------------------
## Systolic
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 111 1 144.2 36.69 95 102
## .25 .50 .75 .90 .95
## 117 144 171 189 194
##
## lowest : 90 91 92 93 94, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## Cholesterol
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 71 1 164 23.48 133 136
## .25 .50 .75 .90 .95
## 146 164 182 192 196
##
## lowest : 130 131 132 133 134, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## HDL
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 81 1 59.6 27.56 23 27
## .25 .50 .75 .90 .95
## 39 59 81 93 97
##
## lowest : 20 21 22 23 24, highest: 96 97 98 99 100
## --------------------------------------------------------------------------------
## Risk
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 439 1 19.67 18.37 1.20 2.20
## .25 .50 .75 .90 .95
## 6.30 14.40 29.00 45.13 55.30
##
## lowest : 0.1 0.2 0.3 0.4 0.5 , highest: 76.5 76.8 78.1 78.5 85.4
## --------------------------------------------------------------------------------
summary(dataset)
## isMale isBlack isSmoker isDiabetic isHypertensive
## Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.00 Median :1.00 Median :1.000 Median :1.000 Median :0.000
## Mean :0.49 Mean :0.53 Mean :0.516 Mean :0.522 Mean :0.495
## 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.00 Max. :1.00 Max. :1.000 Max. :1.000 Max. :1.000
## Age Systolic Cholesterol HDL Risk
## Min. :40.00 Min. : 90.0 Min. :130 Min. : 20.0 Min. : 0.10
## 1st Qu.:49.00 1st Qu.:117.0 1st Qu.:146 1st Qu.: 39.0 1st Qu.: 6.30
## Median :59.00 Median :144.0 Median :164 Median : 59.0 Median :14.40
## Mean :59.11 Mean :144.2 Mean :164 Mean : 59.6 Mean :19.67
## 3rd Qu.:69.00 3rd Qu.:171.0 3rd Qu.:182 3rd Qu.: 81.0 3rd Qu.:29.00
## Max. :79.00 Max. :200.0 Max. :200 Max. :100.0 Max. :85.40
var(dataset$Age)
## [1] 133.0906
var(dataset$Systolic)
## [1] 1009.621
var(dataset$Cholesterol)
## [1] 413.3045
var(dataset$HDL)
## [1] 569.4669
var(dataset$Risk)
## [1] 290.4959
All the attributes’ variance results are higher than their mean values, which implies that the dataset has greater variability and is more heterogeneous. This might indicate that the values in our dataset are more scattered; have a wider range of values, potentially suggesting a more diverse or varied pattern in the data.
library(ggplot2)
ggplot(dataset, aes(x = Age, y =Systolic, color= 'red'))+
geom_point() +
xlab("Age") +
ylab("Blood Pressure")
In order to gain a deeper understanding of our dataset, we examined the attributes “Systolic” and “Age” to determine if there was a predictive or correlational relationship between them. However, after analyzing the scatter plot, we discovered that there is no discernible relationship or correlation between these two attributes.
ggplot(dataset, aes(x = Systolic, y = Risk)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, aes(color = "Regression Line")) +
facet_wrap(~cut(Age, 3), scales = "free") +
xlab("Systolic Blood Pressure") +
ylab("Risk") +
ggtitle("Relationship between Systolic Blood Pressure and Risk at Different Age Levels") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
However, notable association between ‘Systolic Blood Pressure’, ‘Age’, and ‘Risk’, segmented into various age categories. It shows that risk notably rises with age and Blood Pressure,as the regression line for the age bracket (66,79] exhibits higher risks, indicating a high correlation between advancing age and elevated risk levels in this dataset.
library(tidyr)
dataset_long <- gather(dataset, key = "column", value = "value", Age:ncol(dataset))
ggplot(dataset_long, aes(x = value, fill = column)) +
geom_density(alpha = 0.7) +
facet_wrap(~column, scales = "free") +
xlab("Value") +
ylab("Density")
To understand the relative frequency of different values within our dataest we measeured the density, and analyzed the corresponding graphs. Here are the observations we made:
- The graph representing the distribution of ages shows a reasonable representation of ages between 40 and 80 within the dataset. This suggests that the age values are well-distributed within this range.
- Both the density graphs for cholesterol and HDL indicate a slight skew towards lower cholesterol levels. This suggests that the majority of the data points tend to have lower cholesterol values rather than higher ones.
- The density graph for systolic blood pressure displays a uniform distribution across the entire range of blood pressures. This indicates that the data points are evenly spread out without any significant concentration in specific pressure ranges.
- The density graph for the risk variable exhibits a positively skewed (right-skewed) distribution. This implies that there is a higher frequency of data points with lower risk values, while the occurrence of higher risk values is relatively less frequent.
bb <- dataset$isSmoker %>% table() %>%
barplot(bb , col = c("lightgreen","darkred"), width= c(4,4.1),space=0.1, names.arg=c("o","1"), legend.text = c("Non-Smoker","Smoker"))
To better understand the smoking status within our dataset, we visualized the data using a bar plot. This visualization was chosen to provide a clear and easily interpretable representation of the differences in smoking status. From the bar plot, we observed that the numbers are nearly evenly distributed between non-smokers (0) and smokers (1). This indicates that there is a balanced representation of individuals who are non-smokers and smokers in the dataset.
library(corrplot)
## corrplot 0.92 loaded
corr_matrix <- cor(dataset)
corrplot(corr_matrix, method = "color", type = "lower", tl.col = "black", tl.srt = 45,
addCoef.col = "black", number.cex = 0.7, tl.cex = 0.7, col = colorRampPalette(c("white", "lightblue"))(90))
## Warning in ind1:ind2: numerical expression has 2 elements: only the first used
By analyzing the correlation matrix of our dataset, we can identify suspicious events and patterns in the data. However, it is evident that there are no strong correlations among the features in the dataset. Despite this, we can rank the correlations in descending order based on their impact on the risk of heart disease.The order of correlations, from highest to lowest in terms of their influence on heart disease risk, is as follows: Age, Systolic blood pressure, is Diabetic, is Smoker, is Hypertensive, gender is male, , race is black, Cholestrol, HDL.
boxplot(dataset$Age)
The Age boxplot shows a wide range of values that might lead to a lower accuracy of the results when it comes to clculations so we need change it to a standardized range. Additionally, the boxplot analysis indicates that there are no outliers present in the Age attribute. This implies that the Age data points are within a reasonable range and do not deviate significantly from the overall distribution of values.
boxplot(dataset$Systolic)
The boxplot analysis of the Systolic blood pressure attribute reveals the absence of outliers, indicating that the data points are within a reasonable range without any extreme values. However, it is worth noting that the range of Systolic blood pressure is considerably large. To ensure accurate calculations and mitigate potential conflicts, it is recommended to transform the Systolic blood pressure into a smaller and standardized range. This transformation will help normalize the data and make it more suitable for analysis and calculations.
boxplot(dataset$Cholesterol)
According to the boxplot analysis of the Cholesterol attribute, no outliers are observed, suggesting that the data points are within a reasonable range without any extreme values. However, it is important to narrow down the range of values to optimize the accuracy of our calculations. By reducing the range of Cholesterol values, we can improve the reliability and precision of our dataset, enabling us to obtain more reliable and meaningful results.
boxplot(dataset$HDL)
The HDL boxplot reveal that there are no outlires shown. However, it is necessary to transform the range of HDL values to bring them into a standardized and common range. By performing this transformation, we can almost ensure to have better insights and improved data quality.
Since missing/null values can affect the dataset badly we decided to check it and delete all missing/null values from our dataset to make it as clean as possible so that we can end up with efficint dataset resulting to a higher possibiliaty of accurete results later on.
# Check for missing values
missing_values <- colSums(is.na(dataset))
# Print columns with missing values
print("Columns with missing values:")
## [1] "Columns with missing values:"
print(names(missing_values)[missing_values > 0])
## character(0)
# Print the count of missing values for each column
print("Count of missing values for each column:")
## [1] "Count of missing values for each column:"
print(missing_values)
## isMale isBlack isSmoker isDiabetic isHypertensive
## 0 0 0 0 0
## Age Systolic Cholesterol HDL Risk
## 0 0 0 0 0
In data analysis, checking and removing outliers is crucial to ensure the reliability of statistical insights. Outliers, as extreme data points, can distort summary statistics, potentially leading to inaccurate analyses. By identifying and, if necessary, removing outliers, we enhance the robustness of our findings.
# Compute IRQ
Q1 <- quantile(dataset$Age, 0.25)
Q3 <- quantile(dataset$Age, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Age < lower_bound | dataset$Age > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Age outliers:", num_outliers))
## [1] "Number of Age outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$Systolic, 0.25)
Q3 <- quantile(dataset$Systolic, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Systolic < lower_bound | dataset$Systolic > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Systolic outliers:", num_outliers))
## [1] "Number of Systolic outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$Cholesterol, 0.25)
Q3 <- quantile(dataset$Cholesterol, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Cholesterol < lower_bound | dataset$Cholesterol > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Cholesterol outliers:", num_outliers))
## [1] "Number of Cholesterol outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$HDL, 0.25)
Q3 <- quantile(dataset$HDL, 0.75)
IQR <- Q3 - Q1
# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$HDL < lower_bound | dataset$HDL > upper_bound)
# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of HDL outliers:", num_outliers))
## [1] "Number of HDL outliers: 0"
The result indicates that there are no outliers, but we will also use a box plot to ensure that there are no outliers.
boxplot(dataset[,c(6,7,8,9)], main="Boxplot with Outliers", col=c("lightblue","lightblue","lightblue","lightblue"))
By using the box plot we can see that there are no outliers in the data set.
In analyzing the dataset,The initial dataset provided a comprehensive and relevant set of information for the research objectives without the need for removal or condensation of variables.
used the findCorrelation function in caret library to outputs the index of variables we need to delete. targeting any pair with a correlation coefficient exceeding 0.75.
findCorrelation(cor(dataset), cutoff=0.75)
## integer(0)
In our case, the function finds out that no feature need to be deleted.
Data normalization is a preprocessing step that involves transforming numerical data within a dataset to a standard, uniform scale. This process ensures that all variables, regardless of their original units or scales, are brought into a consistent and comparable range. the following attributes were selected for normalization:(age, systolic, cholestrol, HDL)
normalize <- function(x)
{
return ((x - min(x))/ (max(x)- min(x)) )
}
dataset$Age<-normalize(dataset$Age)
dataset$Systolic<-normalize(dataset$Systolic)
dataset$Cholesterol<-normalize(dataset$Cholesterol)
dataset$HDL<-normalize(dataset$HDL)
head(dataset)
we have successfully completed the data normalization. This process entailed scaling our numerical features to a standardized range, typically between 0 and 1.
To make our dataset understandable and easily interpretable, especially when using tree-based classification methods, we transformed the continuous class label ‘Risk’ into specific, categorized risk levels.
These levels are delineated as:
Low risk (<5%), Borderline risk (5% to 7.4%), Intermediate risk (7.5% to 19.9%), and High risk (≥20%).
# Categorize 'Risk' into defined categories
dataset$Risk <- cut(
dataset$Risk,
breaks = c(-Inf, 5, 7.4, 19.9, Inf),
labels = c("Low risk", "Borderline risk", "Intermediate risk", "High risk"),
right = FALSE,
include.lowest = TRUE
)
our dataset after Discretization:
head(dataset)
Feature selection is one of the most important task to boost performance of our machine learning model by removing irrelevant features the model will make decisions only using important features. we will use Recursive Feature Elimination (RFE), which is a widely used wrapper-type algorithm for selecting features that are most relevant in predicting the target variable ‘Risk’ in our case.
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-2
# ensure results are repeatable
set.seed(7)
# Define RFE control parameters
ctrl <- rfeControl(functions=rfFuncs, method="cv", number=10)
# Execute RFE using dataset features 1-9 and "Risk" as the class lable
results <- rfe(dataset[,1:9], dataset$Risk, sizes=c(1:9), rfeControl=ctrl)
# Display RFE results
print(results)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.5832 0.3887 0.03859 0.05413
## 2 0.5489 0.3401 0.03516 0.05332
## 3 0.6230 0.4335 0.03123 0.04525
## 4 0.6671 0.5073 0.04478 0.06397
## 5 0.6770 0.5222 0.02512 0.03598
## 6 0.7132 0.5739 0.03336 0.05041
## 7 0.7821 0.6764 0.03986 0.05887
## 8 0.7812 0.6748 0.03076 0.04539
## 9 0.8009 0.7051 0.02630 0.03865 *
##
## The top 5 variables (out of 9):
## Age, Systolic, isDiabetic, isSmoker, isMale
plot(results, type=c("g", "o"))
The asterisk (*) in the column indicates the number of features recommended by RFE as yielding the best model according to the resampling results. it shows that when 9 variables are used, the model achieves the best accuracy of approximately 80% and a kappa value of 0.7.
The graphical representation of feature importance :
The “Mean Decrease Gini” score tells us how crucial a feature is for making accurate predictions in a Random Forest model. A higher score means the feature is more valuable in deciding how to classify the data correctly, helping the model make better decisions.
# Setting seed for reproducibility
set.seed(123)
# Fit a random forest model
rf_model <- randomForest(Risk ~ ., data = dataset)
var_imp <- importance(rf_model)
var_imp_df <- data.frame(variables = row.names(var_imp), var_imp)
# Sorting variables based on importance
var_imp_df <- var_imp_df[order(var_imp_df$MeanDecreaseGini, decreasing = TRUE),]
# Plotting variable importance using ggplot2
ggplot(var_imp_df, aes(x = reorder(variables, MeanDecreaseGini), y = MeanDecreaseGini)) +geom_col() +
coord_flip() +
labs(title = "Feature Importance",
x = "Features",
y = "Importance (Mean Decrease in Gini)")
The graph shows that ‘Age’ and ‘Systolic’ are key variables influencing our model’s predictions of ‘Risk’. while variables like isHypertensive, isBlack were found to have the least impact on the model’s predictive capability.
Overall, we think it’s a good practice to make use of all our features as recommended by RFE, particularly when we are dealing with a modest number, to avoid potential overfitting.we
Balancing data is crucial for improving the performance and fairness of machine learning models. When data are imbalanced, with one class significantly outnumbering the others, models tend to bias towards the majority class, leading to poor predictive accuracy for minority classes.
# Calculate class distribution
class_distribution <- table(dataset$Risk)
# Create a bar plot
barplot(class_distribution,
main = "Class Distribution for Risk",
xlab = "Risk Level",
ylab = "Count",
names.arg = levels(dataset$Risk))
library(ROSE)
## Loaded ROSE 0.0-4
balanced_data <- upSample(dataset[, 1:9], dataset$Risk, yname = "Risk")
# Plot the distribution of the "Risk" classes
plot(balanced_data$Risk)
# Check the proportion and count of "Risk" classes
prop_table <- prop.table(table(balanced_data$Risk))
count_table <- table(balanced_data$Risk)
After balancing our data, the model becomes more capable of providing accurate predictions, ensuring a fair evaluation of their performance.
Classification analysis is a fundamental aspect of machine learning, focusing on categorizing data into distinct classes. In our study, we aim to build predictive models that efficiently assign predefined labels to new instances based on their features. To enhance the robustness of our models, we have divided the dataset into three sets: training, validation, and testing. By employing different proportions of training data—60%, 70%, and 80%—we seek to evaluate and compare the models’ performances. This approach ensures a comprehensive understanding of model behavior under varying training scenarios, guiding us to select the most effective model for our specific dataset.
Gain ratio is a metric that assesses the quality of a split within decision tree algorithms. to evaluate the quality of a split based on the information gain and the intrinsic information of a feature.we have implemented the Gain Ratio (C4.5) algorithm and the J48 function from the RWeka package. This algorithm partitions our data into training and testing sets, builds a J48 decision tree on the training data,
1-partition the data into ( 60% training, 40% testing):
# Load the RWeka package
library(RWeka)
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# Define the formula
myFormula <- Risk ~ .
# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData)
# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##
## Low risk Borderline risk Intermediate risk High risk
## Low risk 240 1 3 1
## Borderline risk 6 217 5 1
## Intermediate risk 0 0 225 13
## High risk 0 3 17 227
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | HDL <= 0.225
## | | Systolic <= 0.545455
## | | | isHypertensive <= 0
## | | | | Age <= 0.025641: Low risk (6.0)
## | | | | Age > 0.025641
## | | | | | HDL <= 0.0125: Intermediate risk (6.0)
## | | | | | HDL > 0.0125
## | | | | | | Cholesterol <= 0.2
## | | | | | | | Systolic <= 0.290909: Low risk (3.0)
## | | | | | | | Systolic > 0.290909: Intermediate risk (3.0)
## | | | | | | Cholesterol > 0.2
## | | | | | | | Age <= 0.128205
## | | | | | | | | isBlack <= 0: Borderline risk (2.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | Age <= 0.051282: Intermediate risk (2.0)
## | | | | | | | | | Age > 0.051282: Low risk (2.0)
## | | | | | | | Age > 0.128205
## | | | | | | | | Age <= 0.435897: Borderline risk (18.0)
## | | | | | | | | Age > 0.435897
## | | | | | | | | | isDiabetic <= 0
## | | | | | | | | | | Age <= 0.461538: Borderline risk (3.0)
## | | | | | | | | | | Age > 0.461538
## | | | | | | | | | | | Systolic <= 0.190909: Intermediate risk (2.0)
## | | | | | | | | | | | Systolic > 0.190909: Borderline risk (3.0)
## | | | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | isHypertensive > 0
## | | | | isDiabetic <= 0
## | | | | | Systolic <= 0.309091
## | | | | | | Systolic <= 0.190909
## | | | | | | | isMale <= 0: Low risk (6.0)
## | | | | | | | isMale > 0: Intermediate risk (3.0)
## | | | | | | Systolic > 0.190909: Borderline risk (5.0/1.0)
## | | | | | Systolic > 0.309091: Intermediate risk (10.0)
## | | | | isDiabetic > 0
## | | | | | isMale <= 0: Intermediate risk (11.0/1.0)
## | | | | | isMale > 0
## | | | | | | isBlack <= 0: High risk (3.0/1.0)
## | | | | | | isBlack > 0: Intermediate risk (4.0/1.0)
## | | Systolic > 0.545455
## | | | Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## | | | Cholesterol > 0.014286
## | | | | isSmoker <= 0
## | | | | | Systolic <= 0.681818: High risk (3.0)
## | | | | | Systolic > 0.681818
## | | | | | | Age <= 0.461538: Intermediate risk (9.0)
## | | | | | | Age > 0.461538: High risk (3.0/1.0)
## | | | | isSmoker > 0: High risk (29.0/6.0)
## | HDL > 0.225
## | | Age <= 0.282051
## | | | isBlack <= 0
## | | | | Cholesterol <= 0.557143
## | | | | | Systolic <= 0.718182: Low risk (78.0)
## | | | | | Systolic > 0.718182
## | | | | | | isDiabetic <= 0: Low risk (18.0/1.0)
## | | | | | | isDiabetic > 0
## | | | | | | | HDL <= 0.6375
## | | | | | | | | Systolic <= 0.909091: Low risk (5.0)
## | | | | | | | | Systolic > 0.909091: Intermediate risk (2.0)
## | | | | | | | HDL > 0.6375: Borderline risk (11.0)
## | | | | Cholesterol > 0.557143
## | | | | | Systolic <= 0.163636: Low risk (5.0)
## | | | | | Systolic > 0.163636
## | | | | | | isSmoker <= 0
## | | | | | | | Age <= 0.230769: Low risk (15.0)
## | | | | | | | Age > 0.230769: Borderline risk (8.0/1.0)
## | | | | | | isSmoker > 0
## | | | | | | | HDL <= 0.7375: Borderline risk (33.0/4.0)
## | | | | | | | HDL > 0.7375: Low risk (8.0/1.0)
## | | | isBlack > 0
## | | | | Systolic <= 0.536364
## | | | | | isMale <= 0
## | | | | | | Cholesterol <= 0.828571: Low risk (30.0/1.0)
## | | | | | | Cholesterol > 0.828571: Borderline risk (2.0)
## | | | | | isMale > 0
## | | | | | | isDiabetic <= 0
## | | | | | | | isSmoker <= 0
## | | | | | | | | isHypertensive <= 0: Low risk (9.0)
## | | | | | | | | isHypertensive > 0: Borderline risk (6.0/1.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Age <= 0.179487: Borderline risk (12.0/1.0)
## | | | | | | | | Age > 0.179487: Intermediate risk (2.0)
## | | | | | | isDiabetic > 0
## | | | | | | | Systolic <= 0.072727: Low risk (4.0)
## | | | | | | | Systolic > 0.072727: Intermediate risk (9.0/1.0)
## | | | | Systolic > 0.536364
## | | | | | isHypertensive <= 0
## | | | | | | Age <= 0.205128
## | | | | | | | isMale <= 0
## | | | | | | | | Age <= 0.128205: Borderline risk (5.0)
## | | | | | | | | Age > 0.128205: Low risk (6.0/1.0)
## | | | | | | | isMale > 0
## | | | | | | | | Cholesterol <= 0.685714: Intermediate risk (5.0/1.0)
## | | | | | | | | Cholesterol > 0.685714: Borderline risk (8.0)
## | | | | | | Age > 0.205128
## | | | | | | | isSmoker <= 0
## | | | | | | | | Age <= 0.25641: Intermediate risk (2.0)
## | | | | | | | | Age > 0.25641: Low risk (2.0)
## | | | | | | | isSmoker > 0: Intermediate risk (4.0)
## | | | | | isHypertensive > 0
## | | | | | | Systolic <= 0.890909
## | | | | | | | Age <= 0.076923
## | | | | | | | | HDL <= 0.5625: Intermediate risk (3.0)
## | | | | | | | | HDL > 0.5625: Borderline risk (7.0)
## | | | | | | | Age > 0.076923
## | | | | | | | | Age <= 0.179487: Intermediate risk (7.0)
## | | | | | | | | Age > 0.179487
## | | | | | | | | | isDiabetic <= 0: Intermediate risk (4.0/1.0)
## | | | | | | | | | isDiabetic > 0: High risk (2.0)
## | | | | | | Systolic > 0.890909: High risk (7.0)
## | | Age > 0.282051
## | | | Systolic <= 0.7
## | | | | isDiabetic <= 0
## | | | | | isMale <= 0
## | | | | | | Age <= 0.487179
## | | | | | | | Systolic <= 0.381818: Low risk (19.0)
## | | | | | | | Systolic > 0.381818
## | | | | | | | | HDL <= 0.55: Borderline risk (3.0)
## | | | | | | | | HDL > 0.55: Low risk (10.0/1.0)
## | | | | | | Age > 0.487179
## | | | | | | | Cholesterol <= 0.3: Low risk (3.0)
## | | | | | | | Cholesterol > 0.3
## | | | | | | | | Systolic <= 0.363636: Borderline risk (13.0)
## | | | | | | | | Systolic > 0.363636
## | | | | | | | | | Cholesterol <= 0.414286: Borderline risk (2.0)
## | | | | | | | | | Cholesterol > 0.414286: Intermediate risk (2.0)
## | | | | | isMale > 0
## | | | | | | Systolic <= 0.663636
## | | | | | | | isHypertensive <= 0
## | | | | | | | | isSmoker <= 0: Low risk (5.0)
## | | | | | | | | isSmoker > 0: Intermediate risk (2.0)
## | | | | | | | isHypertensive > 0: Intermediate risk (18.0)
## | | | | | | Systolic > 0.663636: Borderline risk (7.0)
## | | | | isDiabetic > 0
## | | | | | Age <= 0.461538
## | | | | | | isSmoker <= 0
## | | | | | | | isMale <= 0
## | | | | | | | | isHypertensive <= 0
## | | | | | | | | | HDL <= 0.675: Borderline risk (8.0/1.0)
## | | | | | | | | | HDL > 0.675: Low risk (4.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Systolic <= 0.290909: Low risk (2.0)
## | | | | | | | | | Systolic > 0.290909: Intermediate risk (2.0)
## | | | | | | | isMale > 0
## | | | | | | | | Systolic <= 0.072727: Borderline risk (12.0)
## | | | | | | | | Systolic > 0.072727
## | | | | | | | | | isHypertensive <= 0: Intermediate risk (5.0)
## | | | | | | | | | isHypertensive > 0: Borderline risk (4.0)
## | | | | | | isSmoker > 0
## | | | | | | | isHypertensive <= 0
## | | | | | | | | Systolic <= 0.6
## | | | | | | | | | Cholesterol <= 0.628571: Borderline risk (13.0/1.0)
## | | | | | | | | | Cholesterol > 0.628571: Intermediate risk (2.0)
## | | | | | | | | Systolic > 0.6: High risk (2.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | isMale <= 0: Intermediate risk (3.0/1.0)
## | | | | | | | | isMale > 0: High risk (2.0)
## | | | | | Age > 0.461538
## | | | | | | Cholesterol <= 0.328571: Borderline risk (2.0)
## | | | | | | Cholesterol > 0.328571: Intermediate risk (19.0/1.0)
## | | | Systolic > 0.7
## | | | | Systolic <= 0.9
## | | | | | isSmoker <= 0: Intermediate risk (12.0)
## | | | | | isSmoker > 0
## | | | | | | Age <= 0.384615: Intermediate risk (6.0)
## | | | | | | Age > 0.384615: High risk (7.0/1.0)
## | | | | Systolic > 0.9
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.936364: Borderline risk (7.0)
## | | | | | | Systolic > 0.936364: Intermediate risk (5.0/1.0)
## | | | | | isDiabetic > 0: High risk (4.0)
## Age > 0.564103
## | Systolic <= 0.5
## | | isDiabetic <= 0
## | | | HDL <= 0.15
## | | | | Systolic <= 0.190909
## | | | | | isMale <= 0: Low risk (2.0)
## | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | Systolic > 0.190909: High risk (9.0)
## | | | HDL > 0.15
## | | | | Systolic <= 0.427273
## | | | | | Cholesterol <= 0.7
## | | | | | | Systolic <= 0.290909
## | | | | | | | isHypertensive <= 0
## | | | | | | | | HDL <= 0.6: Intermediate risk (7.0)
## | | | | | | | | HDL > 0.6
## | | | | | | | | | Cholesterol <= 0.371429: Intermediate risk (3.0)
## | | | | | | | | | Cholesterol > 0.371429
## | | | | | | | | | | Systolic <= 0.054545: Intermediate risk (2.0)
## | | | | | | | | | | Systolic > 0.054545: Borderline risk (18.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | Systolic <= 0.172727
## | | | | | | | | | Age <= 0.692308: Low risk (3.0)
## | | | | | | | | | Age > 0.692308: Intermediate risk (3.0)
## | | | | | | | | Systolic > 0.172727: Borderline risk (7.0/1.0)
## | | | | | | Systolic > 0.290909: Intermediate risk (15.0/1.0)
## | | | | | Cholesterol > 0.7
## | | | | | | Age <= 0.897436: Intermediate risk (12.0)
## | | | | | | Age > 0.897436
## | | | | | | | Systolic <= 0.209091: Intermediate risk (3.0/1.0)
## | | | | | | | Systolic > 0.209091: High risk (3.0)
## | | | | Systolic > 0.427273
## | | | | | Systolic <= 0.472727: High risk (5.0)
## | | | | | Systolic > 0.472727: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Age <= 0.923077
## | | | | | Systolic <= 0.318182: Intermediate risk (21.0/3.0)
## | | | | | Systolic > 0.318182: High risk (8.0/1.0)
## | | | | Age > 0.923077: High risk (5.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.794872: Intermediate risk (4.0)
## | | | | | | Age > 0.794872: High risk (2.0)
## | | | | | isBlack > 0: High risk (3.0)
## | | | | isHypertensive > 0: High risk (22.0)
## | Systolic > 0.5: High risk (128.0/10.0)
##
## Number of Leaves : 110
##
## Size of the tree : 219
# Plot the J48 decision tree
plot(C45Fit)
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 126 4 5 1
## Borderline risk 16 165 24 12
## Intermediate risk 6 0 87 27
## High risk 3 7 31 115
# Calculate performance metrics
accuracy_G1 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G1 <-( 1 - accuracy_G1)
sensitivity_G1 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G1 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G1 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_G1, "\n")
## Accuracy: 0.7837838
cat("Error Rate: ", error_rate_G1, "\n")
## Error Rate: 0.2162162
cat("Sensitivity (Recall): ", sensitivity_G1, "\n")
## Sensitivity (Recall): 0.7371795
cat("Specificity: ", specificity_G1, "\n")
## Specificity: 0.7991543
cat("Precision: ", precision_G1, "\n")
## Precision: 0.7419355
Analysis:
- The C4.5 decision tree, employing the gain ratio criterion, showcases robust performance on our dataset with an accuracy of 78.38%. Its ability to effectively capture complex relationships is reflected in the tree’s structure, consisting of 219 nodes and 110 leaves. Notably, the model demonstrates a balanced trade-off between sensitivity (73.72%) and specificity (79.92%), indicating its proficiency in correctly identifying positive and negative instances. With a precision of 74.19%, the model reliably makes accurate positive predictions.
2-partition the data into ( 70% training, 30% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# Define the formula
myFormula <- Risk ~ .
# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData )
# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##
## Low risk Borderline risk Intermediate risk High risk
## Low risk 272 1 5 1
## Borderline risk 4 270 6 4
## Intermediate risk 5 0 265 12
## High risk 0 0 14 273
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | HDL <= 0.225
## | | Systolic <= 0.545455
## | | | isDiabetic <= 0
## | | | | isSmoker <= 0
## | | | | | Age <= 0.25641: Low risk (13.0)
## | | | | | Age > 0.25641
## | | | | | | isHypertensive <= 0
## | | | | | | | Cholesterol <= 0.257143: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.081818: Low risk (3.0)
## | | | | | | | Systolic > 0.081818
## | | | | | | | | Age <= 0.538462: Intermediate risk (5.0)
## | | | | | | | | Age > 0.538462: Borderline risk (2.0)
## | | | | isSmoker > 0
## | | | | | isBlack <= 0
## | | | | | | Systolic <= 0.309091: Borderline risk (9.0/1.0)
## | | | | | | Systolic > 0.309091: Intermediate risk (2.0)
## | | | | | isBlack > 0: Intermediate risk (13.0/1.0)
## | | | isDiabetic > 0
## | | | | Age <= 0.410256
## | | | | | Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## | | | | | Cholesterol > 0.357143
## | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Cholesterol <= 0.685714
## | | | | | | | | isSmoker <= 0: Borderline risk (3.0)
## | | | | | | | | isSmoker > 0: High risk (2.0)
## | | | | | | | Cholesterol > 0.685714: Intermediate risk (4.0)
## | | | | Age > 0.410256
## | | | | | Cholesterol <= 0.271429: Low risk (3.0/1.0)
## | | | | | Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## | | Systolic > 0.545455
## | | | Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## | | | Cholesterol > 0.014286
## | | | | isSmoker <= 0
## | | | | | isDiabetic <= 0: Intermediate risk (12.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isMale <= 0: Intermediate risk (4.0/1.0)
## | | | | | | isMale > 0: High risk (5.0)
## | | | | isSmoker > 0
## | | | | | isDiabetic <= 0
## | | | | | | Cholesterol <= 0.242857: Intermediate risk (2.0)
## | | | | | | Cholesterol > 0.242857: High risk (13.0/2.0)
## | | | | | isDiabetic > 0
## | | | | | | HDL <= 0.2: High risk (17.0)
## | | | | | | HDL > 0.2: Intermediate risk (3.0/1.0)
## | HDL > 0.225
## | | Age <= 0.282051
## | | | Systolic <= 0.163636
## | | | | isBlack <= 0: Low risk (44.0)
## | | | | isBlack > 0
## | | | | | isMale <= 0: Low risk (9.0)
## | | | | | isMale > 0
## | | | | | | Systolic <= 0.090909: Low risk (6.0)
## | | | | | | Systolic > 0.090909: Intermediate risk (4.0)
## | | | Systolic > 0.163636
## | | | | isBlack <= 0
## | | | | | Cholesterol <= 0.242857: Low risk (38.0/1.0)
## | | | | | Cholesterol > 0.242857
## | | | | | | HDL <= 0.8125
## | | | | | | | isSmoker <= 0
## | | | | | | | | Age <= 0.230769: Low risk (31.0)
## | | | | | | | | Age > 0.230769
## | | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | | isMale > 0: Borderline risk (12.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Systolic <= 0.309091
## | | | | | | | | | Systolic <= 0.218182: Borderline risk (3.0)
## | | | | | | | | | Systolic > 0.218182: Low risk (9.0)
## | | | | | | | | Systolic > 0.309091
## | | | | | | | | | isMale <= 0
## | | | | | | | | | | isHypertensive <= 0: Borderline risk (17.0/1.0)
## | | | | | | | | | | isHypertensive > 0: Low risk (4.0)
## | | | | | | | | | isMale > 0
## | | | | | | | | | | Systolic <= 0.9
## | | | | | | | | | | | HDL <= 0.4625
## | | | | | | | | | | | | isDiabetic <= 0: Borderline risk (8.0/1.0)
## | | | | | | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | | | | | | | | | HDL > 0.4625: Borderline risk (23.0)
## | | | | | | | | | | Systolic > 0.9: Intermediate risk (2.0)
## | | | | | | HDL > 0.8125
## | | | | | | | isMale <= 0: Low risk (17.0)
## | | | | | | | isMale > 0
## | | | | | | | | Age <= 0.076923: Low risk (3.0)
## | | | | | | | | Age > 0.076923: Intermediate risk (2.0)
## | | | | isBlack > 0
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.554545
## | | | | | | | Systolic <= 0.245455
## | | | | | | | | isMale <= 0: Low risk (2.0)
## | | | | | | | | isMale > 0: Borderline risk (15.0/1.0)
## | | | | | | | Systolic > 0.245455
## | | | | | | | | isHypertensive <= 0: Low risk (20.0/2.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | HDL <= 0.4625: Borderline risk (5.0/1.0)
## | | | | | | | | | HDL > 0.4625: Low risk (5.0)
## | | | | | | Systolic > 0.554545
## | | | | | | | isSmoker <= 0
## | | | | | | | | isMale <= 0
## | | | | | | | | | Age <= 0.153846: Borderline risk (5.0/1.0)
## | | | | | | | | | Age > 0.153846: Low risk (6.0)
## | | | | | | | | isMale > 0
## | | | | | | | | | Cholesterol <= 0.7
## | | | | | | | | | | Systolic <= 0.718182: Borderline risk (3.0)
## | | | | | | | | | | Systolic > 0.718182: Intermediate risk (2.0)
## | | | | | | | | | Cholesterol > 0.7: Borderline risk (10.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | Age <= 0: Borderline risk (5.0/1.0)
## | | | | | | | | Age > 0
## | | | | | | | | | Cholesterol <= 0.071429: Low risk (2.0)
## | | | | | | | | | Cholesterol > 0.071429
## | | | | | | | | | | Cholesterol <= 0.871429: Intermediate risk (12.0)
## | | | | | | | | | | Cholesterol > 0.871429: High risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | Systolic <= 0.309091
## | | | | | | | HDL <= 0.4375
## | | | | | | | | Age <= 0.205128: Intermediate risk (2.0)
## | | | | | | | | Age > 0.205128: Borderline risk (3.0)
## | | | | | | | HDL > 0.4375: Low risk (6.0)
## | | | | | | Systolic > 0.309091
## | | | | | | | isHypertensive <= 0
## | | | | | | | | Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## | | | | | | | | Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## | | | | | | | isHypertensive > 0
## | | | | | | | | Age <= 0.153846
## | | | | | | | | | Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## | | | | | | | | | Systolic > 0.881818: High risk (2.0)
## | | | | | | | | Age > 0.153846: High risk (6.0)
## | | Age > 0.282051
## | | | Systolic <= 0.254545
## | | | | isDiabetic <= 0
## | | | | | isHypertensive <= 0: Low risk (20.0)
## | | | | | isHypertensive > 0
## | | | | | | isMale <= 0
## | | | | | | | Cholesterol <= 0.385714: Low risk (5.0)
## | | | | | | | Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## | | | | | | isMale > 0: Intermediate risk (6.0/1.0)
## | | | | isDiabetic > 0
## | | | | | Age <= 0.435897
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.2: Borderline risk (21.0/1.0)
## | | | | | | | Systolic > 0.2: Low risk (3.0/1.0)
## | | | | | | isHypertensive > 0: Low risk (3.0)
## | | | | | Age > 0.435897: Intermediate risk (10.0)
## | | | Systolic > 0.254545
## | | | | isMale <= 0
## | | | | | isDiabetic <= 0
## | | | | | | Age <= 0.384615
## | | | | | | | HDL <= 0.5125: Intermediate risk (2.0)
## | | | | | | | HDL > 0.5125: Low risk (12.0)
## | | | | | | Age > 0.384615
## | | | | | | | Cholesterol <= 0.814286
## | | | | | | | | Systolic <= 0.936364
## | | | | | | | | | Cholesterol <= 0.414286
## | | | | | | | | | | isSmoker <= 0
## | | | | | | | | | | | Age <= 0.512821: Low risk (2.0)
## | | | | | | | | | | | Age > 0.512821: Borderline risk (8.0)
## | | | | | | | | | | isSmoker > 0: Borderline risk (7.0)
## | | | | | | | | | Cholesterol > 0.414286
## | | | | | | | | | | Age <= 0.512821: Borderline risk (5.0)
## | | | | | | | | | | Age > 0.512821: Intermediate risk (3.0)
## | | | | | | | | Systolic > 0.936364: Intermediate risk (2.0)
## | | | | | | | Cholesterol > 0.814286: Low risk (3.0/1.0)
## | | | | | isDiabetic > 0
## | | | | | | isHypertensive <= 0
## | | | | | | | Systolic <= 0.609091
## | | | | | | | | isBlack <= 0
## | | | | | | | | | Age <= 0.333333: Low risk (2.0)
## | | | | | | | | | Age > 0.333333: Borderline risk (13.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | isSmoker <= 0: Borderline risk (4.0)
## | | | | | | | | | isSmoker > 0: Intermediate risk (3.0)
## | | | | | | | Systolic > 0.609091
## | | | | | | | | isBlack <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isBlack > 0: High risk (2.0)
## | | | | | | isHypertensive > 0
## | | | | | | | Systolic <= 0.827273
## | | | | | | | | isBlack <= 0: Intermediate risk (9.0)
## | | | | | | | | isBlack > 0
## | | | | | | | | | Cholesterol <= 0.814286: Intermediate risk (7.0)
## | | | | | | | | | Cholesterol > 0.814286: High risk (2.0)
## | | | | | | | Systolic > 0.827273: High risk (3.0)
## | | | | isMale > 0
## | | | | | Cholesterol <= 0.914286
## | | | | | | isSmoker <= 0
## | | | | | | | isDiabetic <= 0: Intermediate risk (18.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (6.0/1.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Age <= 0.435897: Borderline risk (4.0)
## | | | | | | | | | Age > 0.435897: Intermediate risk (2.0)
## | | | | | | isSmoker > 0
## | | | | | | | isDiabetic <= 0
## | | | | | | | | isHypertensive <= 0: Intermediate risk (7.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | Systolic <= 0.690909: Intermediate risk (7.0)
## | | | | | | | | | Systolic > 0.690909: High risk (4.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | Cholesterol <= 0.128571: Intermediate risk (4.0)
## | | | | | | | | Cholesterol > 0.128571: High risk (11.0/1.0)
## | | | | | Cholesterol > 0.914286
## | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## | Systolic <= 0.5
## | | isDiabetic <= 0
## | | | HDL <= 0.15
## | | | | Systolic <= 0.190909
## | | | | | isMale <= 0: Low risk (2.0)
## | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | Systolic > 0.190909: High risk (9.0)
## | | | HDL > 0.15
## | | | | Age <= 0.692308
## | | | | | isSmoker <= 0
## | | | | | | Systolic <= 0.172727: Low risk (4.0/1.0)
## | | | | | | Systolic > 0.172727
## | | | | | | | Age <= 0.589744: Intermediate risk (3.0/1.0)
## | | | | | | | Age > 0.589744: Borderline risk (22.0)
## | | | | | isSmoker > 0
## | | | | | | Age <= 0.589744: Borderline risk (3.0)
## | | | | | | Age > 0.589744: Intermediate risk (9.0/1.0)
## | | | | Age > 0.692308
## | | | | | HDL <= 0.975
## | | | | | | Systolic <= 0.427273
## | | | | | | | Cholesterol <= 0.057143
## | | | | | | | | Cholesterol <= 0.028571: Intermediate risk (5.0)
## | | | | | | | | Cholesterol > 0.028571: Borderline risk (4.0)
## | | | | | | | Cholesterol > 0.057143
## | | | | | | | | isSmoker <= 0: Intermediate risk (25.0/2.0)
## | | | | | | | | isSmoker > 0
## | | | | | | | | | Age <= 0.769231: Intermediate risk (4.0)
## | | | | | | | | | Age > 0.769231
## | | | | | | | | | | Systolic <= 0.072727: Intermediate risk (2.0)
## | | | | | | | | | | Systolic > 0.072727: High risk (4.0)
## | | | | | | Systolic > 0.427273: High risk (5.0)
## | | | | | HDL > 0.975: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Systolic <= 0.318182
## | | | | | Age <= 0.820513: Intermediate risk (18.0/1.0)
## | | | | | Age > 0.820513
## | | | | | | isHypertensive <= 0
## | | | | | | | Age <= 0.948718: Intermediate risk (4.0)
## | | | | | | | Age > 0.948718: High risk (4.0/1.0)
## | | | | | | isHypertensive > 0: High risk (3.0)
## | | | | Systolic > 0.318182: High risk (10.0/1.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.794872: Intermediate risk (4.0)
## | | | | | | Age > 0.794872: High risk (2.0)
## | | | | | isBlack > 0: High risk (4.0)
## | | | | isHypertensive > 0: High risk (28.0)
## | Systolic > 0.5
## | | Age <= 0.589744
## | | | isDiabetic <= 0: Borderline risk (4.0/1.0)
## | | | isDiabetic > 0: High risk (7.0/1.0)
## | | Age > 0.589744: High risk (141.0/7.0)
##
## Number of Leaves : 131
##
## Size of the tree : 261
# Plot the J48 decision tree
plot(C45Fit)
#
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 94 2 9 2
## Borderline risk 10 124 10 4
## Intermediate risk 12 0 66 23
## High risk 0 0 22 78
# Calculate performance metrics
accuracy_G2 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G2 <-( 1 - accuracy_G2)
sensitivity_G2 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G2 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G2 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_G2, "\n")
## Accuracy: 0.7938596
cat("Error Rate: ", error_rate_G2, "\n")
## Error Rate: 0.2061404
cat("Sensitivity (Recall): ", sensitivity_G2, "\n")
## Sensitivity (Recall): 0.78
cat("Specificity: ", specificity_G2, "\n")
## Specificity: 0.7977528
cat("Precision: ", precision_G2, "\n")
## Precision: 0.728972
Analysis:
The C4.5 decision tree, employing the gain ratio criterion, exhibits strong predictive accuracy with an impressive 79.39%. Characterized by 261 nodes and 131 leaves, the tree’s depth allows it to capture intricate patterns within the data. Notably, the model strikes a balance between sensitivity (78%) and specificity (79.78%), showcasing its ability to effectively identify positive and negative instances. With a precision of 72.90%, the model demonstrates accuracy in positive predictions.
3-partition the data into ( 80% training, 20% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# Define the formula
myFormula <- Risk ~ .
# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData)
# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##
## Low risk Borderline risk Intermediate risk High risk
## Low risk 317 0 7 1
## Borderline risk 3 304 6 0
## Intermediate risk 2 0 301 21
## High risk 0 1 10 299
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
##
## Age <= 0.564103
## | Age <= 0.333333
## | | HDL <= 0.25
## | | | Systolic <= 0.545455
## | | | | isSmoker <= 0
## | | | | | isDiabetic <= 0
## | | | | | | Systolic <= 0.309091: Low risk (14.0)
## | | | | | | Systolic > 0.309091
## | | | | | | | Age <= 0.076923: Low risk (5.0)
## | | | | | | | Age > 0.076923: Borderline risk (11.0)
## | | | | | isDiabetic > 0
## | | | | | | isMale <= 0: Intermediate risk (3.0)
## | | | | | | isMale > 0
## | | | | | | | Cholesterol <= 0.457143: Low risk (3.0/1.0)
## | | | | | | | Cholesterol > 0.457143
## | | | | | | | | isHypertensive <= 0: Borderline risk (5.0)
## | | | | | | | | isHypertensive > 0
## | | | | | | | | | isBlack <= 0: Borderline risk (4.0)
## | | | | | | | | | isBlack > 0: Intermediate risk (2.0)
## | | | | isSmoker > 0
## | | | | | Systolic <= 0.327273
## | | | | | | Age <= 0: Low risk (3.0/1.0)
## | | | | | | Age > 0
## | | | | | | | Age <= 0.205128
## | | | | | | | | HDL <= 0.0125: Intermediate risk (2.0)
## | | | | | | | | HDL > 0.0125
## | | | | | | | | | isDiabetic <= 0: Borderline risk (12.0)
## | | | | | | | | | isDiabetic > 0
## | | | | | | | | | | Systolic <= 0.1: Borderline risk (6.0)
## | | | | | | | | | | Systolic > 0.1: Intermediate risk (2.0)
## | | | | | | | Age > 0.205128: Intermediate risk (3.0/1.0)
## | | | | | Systolic > 0.327273
## | | | | | | Systolic <= 0.372727: Low risk (3.0/1.0)
## | | | | | | Systolic > 0.372727: Intermediate risk (9.0)
## | | | Systolic > 0.545455
## | | | | Systolic <= 0.618182
## | | | | | isDiabetic <= 0
## | | | | | | isHypertensive <= 0: Borderline risk (10.0)
## | | | | | | isHypertensive > 0: Intermediate risk (2.0/1.0)
## | | | | | isDiabetic > 0: High risk (6.0/1.0)
## | | | | Systolic > 0.618182
## | | | | | isSmoker <= 0
## | | | | | | Age <= 0.025641: Low risk (2.0)
## | | | | | | Age > 0.025641
## | | | | | | | isBlack <= 0: Intermediate risk (5.0)
## | | | | | | | isBlack > 0: High risk (4.0/1.0)
## | | | | | isSmoker > 0
## | | | | | | isBlack <= 0
## | | | | | | | isMale <= 0
## | | | | | | | | Cholesterol <= 0.485714: Intermediate risk (3.0)
## | | | | | | | | Cholesterol > 0.485714: High risk (2.0)
## | | | | | | | isMale > 0: High risk (4.0)
## | | | | | | isBlack > 0: High risk (14.0/1.0)
## | | HDL > 0.25
## | | | isBlack <= 0
## | | | | isSmoker <= 0
## | | | | | Age <= 0.230769: Low risk (82.0)
## | | | | | Age > 0.230769
## | | | | | | isDiabetic <= 0: Low risk (15.0)
## | | | | | | isDiabetic > 0
## | | | | | | | isMale <= 0: Low risk (6.0)
## | | | | | | | isMale > 0
## | | | | | | | | Systolic <= 0.145455: Low risk (2.0)
## | | | | | | | | Systolic > 0.145455: Borderline risk (14.0)
## | | | | isSmoker > 0
## | | | | | HDL <= 0.8125
## | | | | | | Systolic <= 0.309091
## | | | | | | | Age <= 0.179487: Low risk (20.0)
## | | | | | | | Age > 0.179487
## | | | | | | | | isDiabetic <= 0: Low risk (2.0)
## | | | | | | | | isDiabetic > 0: Borderline risk (6.0)
## | | | | | | Systolic > 0.309091
## | | | | | | | Cholesterol <= 0.228571
## | | | | | | | | isMale <= 0: Low risk (6.0)
## | | | | | | | | isMale > 0: Intermediate risk (3.0/1.0)
## | | | | | | | Cholesterol > 0.228571
## | | | | | | | | isMale <= 0
## | | | | | | | | | isHypertensive <= 0: Borderline risk (20.0/1.0)
## | | | | | | | | | isHypertensive > 0: Low risk (5.0)
## | | | | | | | | isMale > 0
## | | | | | | | | | HDL <= 0.55
## | | | | | | | | | | Cholesterol <= 0.871429: Intermediate risk (5.0)
## | | | | | | | | | | Cholesterol > 0.871429: Borderline risk (3.0)
## | | | | | | | | | HDL > 0.55: Borderline risk (27.0)
## | | | | | HDL > 0.8125
## | | | | | | Cholesterol <= 0.742857: Low risk (29.0)
## | | | | | | Cholesterol > 0.742857
## | | | | | | | isMale <= 0: Low risk (3.0)
## | | | | | | | isMale > 0: Intermediate risk (2.0)
## | | | isBlack > 0
## | | | | Systolic <= 0.536364
## | | | | | Cholesterol <= 0.828571
## | | | | | | isMale <= 0
## | | | | | | | isDiabetic <= 0
## | | | | | | | | Cholesterol <= 0.785714: Low risk (29.0)
## | | | | | | | | Cholesterol > 0.785714
## | | | | | | | | | Age <= 0.230769: Low risk (2.0)
## | | | | | | | | | Age > 0.230769: Borderline risk (2.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | Systolic <= 0.327273: Low risk (11.0)
## | | | | | | | | Systolic > 0.327273
## | | | | | | | | | Age <= 0.076923: Low risk (2.0)
## | | | | | | | | | Age > 0.076923: Intermediate risk (3.0)
## | | | | | | isMale > 0
## | | | | | | | Systolic <= 0.090909: Low risk (9.0)
## | | | | | | | Systolic > 0.090909
## | | | | | | | | isDiabetic <= 0
## | | | | | | | | | Age <= 0.282051
## | | | | | | | | | | Systolic <= 0.254545: Borderline risk (17.0/1.0)
## | | | | | | | | | | Systolic > 0.254545: Low risk (9.0/1.0)
## | | | | | | | | | Age > 0.282051: Intermediate risk (5.0)
## | | | | | | | | isDiabetic > 0: Intermediate risk (7.0)
## | | | | | Cholesterol > 0.828571
## | | | | | | isHypertensive <= 0: Borderline risk (13.0/1.0)
## | | | | | | isHypertensive > 0: Intermediate risk (4.0/1.0)
## | | | | Systolic > 0.536364
## | | | | | Age <= 0.102564
## | | | | | | HDL <= 0.625
## | | | | | | | Systolic <= 0.872727: Intermediate risk (9.0/1.0)
## | | | | | | | Systolic > 0.872727: High risk (5.0/1.0)
## | | | | | | HDL > 0.625
## | | | | | | | isDiabetic <= 0: Borderline risk (15.0/1.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | isSmoker <= 0: Intermediate risk (2.0)
## | | | | | | | | isSmoker > 0: Borderline risk (5.0)
## | | | | | Age > 0.102564
## | | | | | | isDiabetic <= 0
## | | | | | | | isHypertensive <= 0
## | | | | | | | | isMale <= 0
## | | | | | | | | | Cholesterol <= 0.642857: Low risk (9.0)
## | | | | | | | | | Cholesterol > 0.642857: Intermediate risk (2.0)
## | | | | | | | | isMale > 0: Intermediate risk (2.0)
## | | | | | | | isHypertensive > 0: Intermediate risk (8.0/1.0)
## | | | | | | isDiabetic > 0
## | | | | | | | isSmoker <= 0: Intermediate risk (8.0/1.0)
## | | | | | | | isSmoker > 0
## | | | | | | | | isMale <= 0
## | | | | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | | | | isHypertensive > 0: High risk (2.0)
## | | | | | | | | isMale > 0: High risk (6.0)
## | Age > 0.333333
## | | Systolic <= 0.254545
## | | | Cholesterol <= 0.828571
## | | | | HDL <= 0.825
## | | | | | isMale <= 0
## | | | | | | Systolic <= 0.090909
## | | | | | | | isDiabetic <= 0: Low risk (8.0)
## | | | | | | | isDiabetic > 0: Intermediate risk (2.0)
## | | | | | | Systolic > 0.090909
## | | | | | | | Cholesterol <= 0.228571
## | | | | | | | | Systolic <= 0.190909: Low risk (5.0)
## | | | | | | | | Systolic > 0.190909: Borderline risk (3.0)
## | | | | | | | Cholesterol > 0.228571
## | | | | | | | | Systolic <= 0.218182
## | | | | | | | | | HDL <= 0.475
## | | | | | | | | | | Age <= 0.410256: Borderline risk (5.0/1.0)
## | | | | | | | | | | Age > 0.410256: Intermediate risk (2.0)
## | | | | | | | | | HDL > 0.475: Borderline risk (20.0)
## | | | | | | | | Systolic > 0.218182: Low risk (3.0/1.0)
## | | | | | isMale > 0
## | | | | | | isHypertensive <= 0
## | | | | | | | HDL <= 0.2125
## | | | | | | | | Cholesterol <= 0.8: Intermediate risk (6.0)
## | | | | | | | | Cholesterol > 0.8: Borderline risk (3.0)
## | | | | | | | HDL > 0.2125
## | | | | | | | | isDiabetic <= 0
## | | | | | | | | | HDL <= 0.2375: Borderline risk (5.0)
## | | | | | | | | | HDL > 0.2375: Low risk (4.0)
## | | | | | | | | isDiabetic > 0: Borderline risk (9.0/1.0)
## | | | | | | isHypertensive > 0: Intermediate risk (8.0/1.0)
## | | | | HDL > 0.825
## | | | | | Age <= 0.461538: Low risk (9.0)
## | | | | | Age > 0.461538: Intermediate risk (2.0)
## | | | Cholesterol > 0.828571
## | | | | isDiabetic <= 0
## | | | | | Age <= 0.461538: Low risk (3.0)
## | | | | | Age > 0.461538: Intermediate risk (2.0)
## | | | | isDiabetic > 0: Intermediate risk (9.0)
## | | Systolic > 0.254545
## | | | HDL <= 0.2
## | | | | isSmoker <= 0
## | | | | | isMale <= 0: Intermediate risk (12.0/1.0)
## | | | | | isMale > 0
## | | | | | | isDiabetic <= 0: Intermediate risk (7.0/1.0)
## | | | | | | isDiabetic > 0: High risk (4.0)
## | | | | isSmoker > 0
## | | | | | Systolic <= 0.354545: Intermediate risk (3.0)
## | | | | | Systolic > 0.354545: High risk (13.0/2.0)
## | | | HDL > 0.2
## | | | | isMale <= 0
## | | | | | Cholesterol <= 0.814286
## | | | | | | isHypertensive <= 0
## | | | | | | | HDL <= 0.95
## | | | | | | | | Cholesterol <= 0.414286
## | | | | | | | | | isBlack <= 0: Borderline risk (22.0)
## | | | | | | | | | isBlack > 0
## | | | | | | | | | | Cholesterol <= 0.328571
## | | | | | | | | | | | Age <= 0.435897: Intermediate risk (2.0)
## | | | | | | | | | | | Age > 0.435897: Low risk (2.0/1.0)
## | | | | | | | | | | Cholesterol > 0.328571: Borderline risk (4.0)
## | | | | | | | | Cholesterol > 0.414286
## | | | | | | | | | Systolic <= 0.709091
## | | | | | | | | | | Age <= 0.512821: Borderline risk (9.0/1.0)
## | | | | | | | | | | Age > 0.512821: Intermediate risk (2.0)
## | | | | | | | | | Systolic > 0.709091: Intermediate risk (6.0)
## | | | | | | | HDL > 0.95: Low risk (2.0)
## | | | | | | isHypertensive > 0
## | | | | | | | isDiabetic <= 0
## | | | | | | | | Cholesterol <= 0.214286: Borderline risk (7.0)
## | | | | | | | | Cholesterol > 0.214286
## | | | | | | | | | HDL <= 0.55: Intermediate risk (5.0)
## | | | | | | | | | HDL > 0.55: Low risk (4.0)
## | | | | | | | isDiabetic > 0
## | | | | | | | | Systolic <= 0.827273: Intermediate risk (12.0)
## | | | | | | | | Systolic > 0.827273: High risk (2.0)
## | | | | | Cholesterol > 0.814286
## | | | | | | Age <= 0.410256: Low risk (6.0/1.0)
## | | | | | | Age > 0.410256
## | | | | | | | Systolic <= 0.581818: Intermediate risk (3.0)
## | | | | | | | Systolic > 0.581818: High risk (4.0)
## | | | | isMale > 0
## | | | | | Cholesterol <= 0.928571
## | | | | | | isDiabetic <= 0: Intermediate risk (34.0/3.0)
## | | | | | | isDiabetic > 0
## | | | | | | | HDL <= 0.6875: High risk (10.0/1.0)
## | | | | | | | HDL > 0.6875
## | | | | | | | | isSmoker <= 0
## | | | | | | | | | Cholesterol <= 0.257143
## | | | | | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | | | | | isHypertensive > 0: Borderline risk (4.0)
## | | | | | | | | | Cholesterol > 0.257143: Intermediate risk (6.0)
## | | | | | | | | isSmoker > 0
## | | | | | | | | | Cholesterol <= 0.314286: Intermediate risk (3.0)
## | | | | | | | | | Cholesterol > 0.314286: High risk (3.0)
## | | | | | Cholesterol > 0.928571
## | | | | | | isHypertensive <= 0: Intermediate risk (2.0)
## | | | | | | isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## | Systolic <= 0.490909
## | | isDiabetic <= 0
## | | | Age <= 0.692308
## | | | | HDL <= 0.1125
## | | | | | isHypertensive <= 0: Low risk (2.0)
## | | | | | isHypertensive > 0: High risk (3.0)
## | | | | HDL > 0.1125
## | | | | | isSmoker <= 0
## | | | | | | Systolic <= 0.181818: Low risk (6.0/1.0)
## | | | | | | Systolic > 0.181818
## | | | | | | | Age <= 0.589744: Intermediate risk (3.0/1.0)
## | | | | | | | Age > 0.589744: Borderline risk (27.0)
## | | | | | isSmoker > 0
## | | | | | | Systolic <= 0.081818: Borderline risk (4.0)
## | | | | | | Systolic > 0.081818: Intermediate risk (11.0/1.0)
## | | | Age > 0.692308
## | | | | HDL <= 0.975
## | | | | | Age <= 0.769231
## | | | | | | Cholesterol <= 0.057143: Borderline risk (5.0/1.0)
## | | | | | | Cholesterol > 0.057143: Intermediate risk (14.0)
## | | | | | Age > 0.769231
## | | | | | | isSmoker <= 0
## | | | | | | | Systolic <= 0.427273: Intermediate risk (25.0/3.0)
## | | | | | | | Systolic > 0.427273: High risk (3.0)
## | | | | | | isSmoker > 0
## | | | | | | | Systolic <= 0.227273
## | | | | | | | | isBlack <= 0: Intermediate risk (3.0)
## | | | | | | | | isBlack > 0: High risk (3.0/1.0)
## | | | | | | | Systolic > 0.227273: High risk (10.0)
## | | | | HDL > 0.975: Borderline risk (5.0)
## | | isDiabetic > 0
## | | | isSmoker <= 0
## | | | | Age <= 0.666667: Intermediate risk (16.0/1.0)
## | | | | Age > 0.666667
## | | | | | isMale <= 0
## | | | | | | Age <= 0.923077
## | | | | | | | HDL <= 0.6: Intermediate risk (10.0/1.0)
## | | | | | | | HDL > 0.6: High risk (5.0/1.0)
## | | | | | | Age > 0.923077: High risk (4.0)
## | | | | | isMale > 0
## | | | | | | isBlack <= 0: High risk (4.0)
## | | | | | | isBlack > 0
## | | | | | | | Age <= 0.769231: High risk (4.0)
## | | | | | | | Age > 0.769231: Intermediate risk (4.0/1.0)
## | | | isSmoker > 0
## | | | | isHypertensive <= 0
## | | | | | isBlack <= 0
## | | | | | | Age <= 0.692308: Intermediate risk (5.0)
## | | | | | | Age > 0.692308: High risk (4.0)
## | | | | | isBlack > 0: High risk (6.0)
## | | | | isHypertensive > 0: High risk (32.0)
## | Systolic > 0.490909
## | | Age <= 0.589744
## | | | isDiabetic <= 0: Borderline risk (4.0/1.0)
## | | | isDiabetic > 0: High risk (7.0/1.0)
## | | Age > 0.589744
## | | | isSmoker <= 0
## | | | | isMale <= 0
## | | | | | Age <= 0.666667: Intermediate risk (5.0/1.0)
## | | | | | Age > 0.666667
## | | | | | | isDiabetic <= 0
## | | | | | | | Systolic <= 0.809091: Intermediate risk (5.0/1.0)
## | | | | | | | Systolic > 0.809091: High risk (7.0)
## | | | | | | isDiabetic > 0: High risk (16.0)
## | | | | isMale > 0: High risk (37.0/1.0)
## | | | isSmoker > 0: High risk (86.0)
##
## Number of Leaves : 153
##
## Size of the tree : 305
# Plot the J48 decision tree
plot(C45Fit)
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 61 0 4 0
## Borderline risk 8 91 5 3
## Intermediate risk 6 0 50 19
## High risk 0 1 14 54
# Calculate performance metrics
accuracy_G3 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G3 <-( 1 - accuracy_G3)
sensitivity_G3 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G3 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G3 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
accuracy <- sum(testPred == testData$Risk) / length(testPred)
# Display performance metrics
cat("Accuracy: ", accuracy_G3, "\n")
## Accuracy: 0.8101266
cat("Error Rate: ", error_rate_G3, "\n")
## Error Rate: 0.1898734
cat("Sensitivity (Recall): ", sensitivity_G3, "\n")
## Sensitivity (Recall): 0.7826087
cat("Specificity: ", specificity_G3, "\n")
## Specificity: 0.8178138
cat("Precision: ", precision_G3, "\n")
## Precision: 0.7105263
Analysis:
The C4.5 decision tree, employing the gain ratio criterion, demonstrates a commendable accuracy of 81.01%. With a substantial tree size of 305 and 153 leaves, the model captures nuanced relationships within the dataset. Its predictive prowess is evident in the balanced sensitivity (78.26%) and specificity (81.78%), highlighting its ability to correctly identify both positive and negative instances. The precision of 71.05% emphasizes the accuracy of positive predictions. This collectively positions the C4.5 decision tree as a robust and effective choice for classification on our dataset, showcasing its capability to achieve high accuracy and reliable predictions.
# Create data frames for each model's summary
summary_c4.5_1 <- data.frame(
Model = "60% training, 40% testing",
Accuracy = 78.38,
Sensitivity = 73.72,
Specificity = 79.92,
Precision = 74.19
)
summary_c4.5_2 <- data.frame(
Model = "70% training, 30% testing",
Accuracy = 79.39,
Sensitivity = 78.0,
Specificity = 79.78,
Precision = 72.90
)
summary_c4.5_3 <- data.frame(
Model = "80% training, 20% testing",
Accuracy = 81.01,
Sensitivity = 78.26,
Specificity = 81.78,
Precision = 71.05
)
# Combine the summaries into a single data frame
comparison_table <- rbind(summary_c4.5_1, summary_c4.5_2, summary_c4.5_3)
# Print the comparison table
print(comparison_table)
## Model Accuracy Sensitivity Specificity Precision
## 1 60% training, 40% testing 78.38 73.72 79.92 74.19
## 2 70% training, 30% testing 79.39 78.00 79.78 72.90
## 3 80% training, 20% testing 81.01 78.26 81.78 71.05
In our exploration of decision tree models—specifically, C4.5 with varying numbers of training-testing —we aimed to identify the optimal configuration for accurate and reliable predictions. The results indicate that the model with (80% training, 20% testing) stands out, achieving the highest accuracy at 81.01%. This particular configuration strikes a balance between sensitivity (78.26%), specificity (81.78%), and precision (71.05%), making it a robust choice for the classification task at hand.
It’s noteworthy that the model with (70% training, 30% testing) also performs well, showcasing competitive accuracy (79.39%) and a balanced trade-off between sensitivity and specificity. However, the model with (60% training, 40% testing) surpasses it, demonstrating superior sensitivity and precision.
In contrast, the model with (60% training, 40% testing), while achieving a respectable accuracy of 78.38%, exhibits slightly lower sensitivity and precision. This suggests that a more complex tree structure, as seen in the model with (80% training, 20% testing), contributes to better capturing the underlying patterns in the data.
In conclusion, the C4.5 decision tree with (80% training, 20% testing) emerges as the preferred model for this specific dataset and classification task. Its superior performance in terms of accuracy, sensitivity, specificity, and precision underscores its suitability for making reliable predictions.
-For the construction of our decision tree model, we have opted for the C5.0 algorithm, a sophisticated and versatile tool known for its proficiency in handling classification tasks. Specifically, we harness the power of information gain as the guiding criterion within C5.0. This choice is deliberate, as information gain allows the algorithm to discern the most relevant and discriminative features in our dataset, facilitating the creation of a decision tree that excels at capturing intricate patterns and relationships.
1-partition the data into ( 60% training, 40% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
dim(trainData)
## [1] 959 10
dim(testData)
## [1] 629 10
# install.packages("C50")
library(C50)
# Define the formula
myFormula <- Risk ~ .
# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)
# Plot the decision tree
plot(c50_model)
# Display a summary of the decision tree
print(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
## Classification Tree
## Number of samples: 959
## Number of predictors: 9
##
## Tree size: 105
##
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 125 4 5 1
## Borderline risk 19 165 28 12
## Intermediate risk 6 0 85 26
## High risk 1 7 29 116
# Calculate performance metrics
accuracy_I1 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I1 <-( 1 - accuracy_I1)
sensitivity_I1 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I1 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I1 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_I1, "\n")
## Accuracy: 0.7806041
cat("Error Rate: ", error_rate_I1, "\n")
## Error Rate: 0.2193959
cat("Sensitivity (Recall): ", sensitivity_I1, "\n")
## Sensitivity (Recall): 0.7581699
cat("Specificity: ", specificity_I1, "\n")
## Specificity: 0.7878151
cat("Precision: ", precision_I1, "\n")
## Precision: 0.7483871
Analysis: The C5 model demonstrates strong predictive capabilities with an accuracy of 78.37%. It effectively identifies instances of low risk (sensitivity of 80.6%) and maintains high specificity (77.6%) in recognizing non-low-risk instances. The precision of 72.26% highlights the accuracy of positive predictions. The model’s tree structure, comprising 120 nodes, reflects its complexity in capturing patterns within the data. These results suggest a well-balanced model with the potential for reliable classification across multiple risk categories.
2-partition the data into ( 70% training, 30% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)
# Define the formula
myFormula <- Risk ~ .
# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)
# Plot the decision tree
plot(c50_model)
# Display a summary of the decision tree
print(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
## Classification Tree
## Number of samples: 1132
## Number of predictors: 9
##
## Tree size: 135
##
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 97 2 9 2
## Borderline risk 7 124 11 4
## Intermediate risk 12 0 64 22
## High risk 0 0 23 79
# Calculate performance metrics
accuracy_I2 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I2 <-( 1 - accuracy_I2)
sensitivity_I2 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I2 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I2 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_I2, "\n")
## Accuracy: 0.7982456
cat("Error Rate: ", error_rate_I2, "\n")
## Error Rate: 0.2017544
cat("Sensitivity (Recall): ", sensitivity_I2, "\n")
## Sensitivity (Recall): 0.7745098
cat("Specificity: ", specificity_I2, "\n")
## Specificity: 0.8050847
cat("Precision: ", precision_I2, "\n")
## Precision: 0.7383178
Analysis: The C5 model achieved an accuracy of 78.07%, demonstrating its proficiency in making correct predictions across all classes. It exhibits robust sensitivity (77.23%), effectively identifying instances of high risk. The model’s specificity (78.31%) suggests improved accuracy in correctly identifying non-high-risk instances compared to the previous configuration. The precision of 72.90% reflects the accuracy of positive predictions. The tree structure, comprising 125 nodes, signifies a moderate level of complexity. Overall, the model performs well, with enhanced specificity, showcasing its suitability for this classification task.
3-partition the data into ( 80% training, 20% testing):sting):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)
# Define the formula
myFormula <- Risk ~ .
# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)
# Plot the decision tree
plot(c50_model)
# Display a summary of the decision tree
print(c50_model)
##
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
##
## Classification Tree
## Number of samples: 1272
## Number of predictors: 9
##
## Tree size: 155
##
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)
# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## testPred Low risk Borderline risk Intermediate risk High risk
## Low risk 61 0 4 0
## Borderline risk 8 90 4 3
## Intermediate risk 6 1 52 16
## High risk 0 1 13 57
# Calculate performance metrics
accuracy_I3 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I3 <-( 1 - accuracy_I3)
sensitivity_I3 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I3 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I3 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])
# Display performance metrics
cat("Accuracy: ", accuracy_I3, "\n")
## Accuracy: 0.8227848
cat("Error Rate: ", error_rate_I3, "\n")
## Error Rate: 0.1772152
cat("Sensitivity (Recall): ", sensitivity_I3, "\n")
## Sensitivity (Recall): 0.8028169
cat("Specificity: ", specificity_I3, "\n")
## Specificity: 0.8285714
cat("Precision: ", precision_I3, "\n")
## Precision: 0.75
Analysis: The C5 model achieved an accuracy of 64.42%, showcasing its ability to make correct predictions across all classes. It exhibits strong sensitivity (73.03%), effectively identifying instances of high risk. However, the model’s specificity (57.98%) suggests potential for improvement in correctly identifying non-high-risk instances. The precision of 84.42% reflects the accuracy of positive predictions. The tree structure, comprising 92 nodes, indicates a moderate level of complexity. While the model performs reasonably well, there may be opportunities for refinement, particularly in specificity. Overall,The model’s strength lies in identifying clear cases (Low and High risk) .
# Create data frames for each model's summary
summary1 <- data.frame(
Model = "60%training 40%testing",
Accuracy = 78.37,
Sensitivity = 80.6,
Specificity = 77.6,
Precision = 72.26
)
summary2 <- data.frame(
Model = "70%training 30%testing",
Accuracy = 78.07,
Sensitivity = 77.23,
Specificity = 78.31,
Precision = 72.90
)
summary3 <- data.frame(
Model = "80%training 20%testing",
Accuracy = 64.42,
Sensitivity = 73.03,
Specificity = 57.98,
Precision = 84.42
)
# Combine the summaries into a single data frame
comparison_table <- rbind(summary1, summary2, summary3)
# Print the comparison table
print(comparison_table)
## Model Accuracy Sensitivity Specificity Precision
## 1 60%training 40%testing 78.37 80.60 77.60 72.26
## 2 70%training 30%testing 78.07 77.23 78.31 72.90
## 3 80%training 20%testing 64.42 73.03 57.98 84.42
Analysis:
Conclusion:
Opting for RPART with the Gini index involves building a decision tree that maximizes class separation by minimizing impurity. This method, rooted in recursive partitioning, aims to create nodes that group similar instances based on the Gini impurity criterion.
1-partition the data into ( 60% training, 40% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
dim(trainData)
## [1] 959 10
dim(testData)
## [1] 629 10
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
library(caret)
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)
# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")
# Create a confusion matrix
conf_matrix_rpart <- table(test_pred, testData$Risk)
# Display the confusion matrix
print(conf_matrix_rpart)
##
## test_pred Low risk Borderline risk Intermediate risk High risk
## Low risk 107 48 16 4
## Borderline risk 31 71 28 4
## Intermediate risk 11 53 65 36
## High risk 2 4 38 111
# Calculate performance metrics
accuracy_D1 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D1 <- 1 - accuracy_D1
sensitivity_D1 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D1 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D1 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
# Display performance metrics
cat("Accuracy: ", accuracy_D1, "\n")
## Accuracy: 0.8227848
cat("Error Rate: ", error_rate_D1, "\n")
## Error Rate: 0.1772152
cat("Sensitivity (Recall): ", sensitivity_D1, "\n")
## Sensitivity (Recall): 0.8571429
cat("Specificity: ", specificity_D1, "\n")
## Specificity: 0.8056872
cat("Precision: ", precision_D1, "\n")
## Precision: 0.9782609
Analysis:
The results obtained from the rpart model showcase a balanced performance across various risk categories. The model achieved an overall accuracy of 57.39%, indicating its ability to make correct predictions across all classes. Sensitivity, measuring the model’s capability to identify positive instances, is at 50.81%, demonstrating a reasonable ability to detect true positives. Specificity stands at 60.14%, indicating the model’s proficiency in correctly identifying negative instances. The precision of 53.41% signifies the accuracy of positive predictions.
2-partition the data into ( 70% training, 30% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)
# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")
# Create a confusion matrix
conf_matrix <- table(test_pred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## test_pred Low risk Borderline risk Intermediate risk High risk
## Low risk 75 12 15 3
## Borderline risk 34 81 21 1
## Intermediate risk 5 31 54 37
## High risk 2 2 17 66
# Calculate performance metrics
accuracy_D2 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D2 <- 1 - accuracy_D2
sensitivity_D2 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D2 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D2 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
# Display performance metrics
cat("Accuracy: ", accuracy_D2, "\n")
## Accuracy: 0.6052632
cat("Error Rate: ", error_rate_D2, "\n")
## Error Rate: 0.3947368
cat("Sensitivity (Recall): ", sensitivity_D2, "\n")
## Sensitivity (Recall): 0.5912409
cat("Specificity: ", specificity_D2, "\n")
## Specificity: 0.6112853
cat("Precision: ", precision_D2, "\n")
## Precision: 0.6428571
Analysis:
The results from the RPART model reveal a well-balanced performance across different risk categories. The model achieved an overall accuracy of 60.31%, indicating its proficiency in making accurate predictions across all classes. Notably, it demonstrated a sensitivity of 55.10%, effectively identifying instances of low risk, and a specificity of 62.78%, accurately recognizing non-low-risk instances. The precision of 64.29% underscores the model’s accuracy in positive predictions.
3-partition the data into ( 80% training, 20% testing):
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree)
# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")
# Create a confusion matrix
conf_matrix <- table(test_pred, testData$Risk)
# Display the confusion matrix
print(conf_matrix)
##
## test_pred Low risk Borderline risk Intermediate risk High risk
## Low risk 50 28 9 3
## Borderline risk 19 41 16 0
## Intermediate risk 4 22 36 28
## High risk 2 1 12 45
# Calculate performance metrics
accuracy_D3 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D3 <- 1 - accuracy_D3
sensitivity_D3 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D3 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D3 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
# Display performance metrics
cat("Accuracy: ", accuracy_D3, "\n")
## Accuracy: 0.5443038
cat("Error Rate: ", error_rate_D3, "\n")
## Error Rate: 0.4556962
cat("Sensitivity (Recall): ", sensitivity_D3, "\n")
## Sensitivity (Recall): 0.5394737
cat("Specificity: ", specificity_D3, "\n")
## Specificity: 0.5458333
cat("Precision: ", precision_D3, "\n")
## Precision: 0.4456522
Analysis:
The outcomes of the RPART model showcase a discernible performance across distinct risk categories. The model achieved an overall accuracy of 54.75%, highlighting its capability to make correct predictions across all classes. Specifically, it demonstrated a sensitivity of 51.02%, effectively identifying instances of low risk, and a specificity of 56.42%, accurately recognizing non-low-risk instances. The precision of 54.35% emphasizes the model’s accuracy in positive predictions.
# Create data frames for each summary
summary1 <- data.frame(
Model = "60% training 40% testing",
Accuracy = 57.39,
Sensitivity = 50.81,
Specificity = 60.14,
Precision = 53.41
)
summary2 <- data.frame(
Model = "70% training, 30% testing",
Accuracy = 60.31,
Sensitivity = 55.10,
Specificity = 62.78,
Precision = 64.29
)
summary3 <- data.frame(
Model = " 80% training 20% testing",
Accuracy = 54.75,
Sensitivity = 51.02,
Specificity = 56.42,
Precision = 54.35
)
# Combine summaries into a single data frame
comparison_table <- rbind(summary1, summary2, summary3)
# Print the comparison table
print(comparison_table)
## Model Accuracy Sensitivity Specificity Precision
## 1 60% training 40% testing 57.39 50.81 60.14 53.41
## 2 70% training, 30% testing 60.31 55.10 62.78 64.29
## 3 80% training 20% testing 54.75 51.02 56.42 54.35
Observations:
The model trained with 70% of the data for training and 30% for testing exhibits the highest overall performance with the highest accuracy, sensitivity, specificity, and precision.
The 60% training and 40% testing model follows closely with competitive metrics across all categories.
The 80% training and 20% testing model lags behind in accuracy and precision but maintains moderate performance in sensitivity and specificity.
Conclusion: Considering the three models, the 70% training and 30% testing model stands out as the most effective, striking a balance between accuracy, sensitivity, specificity, and precision. It outperforms the other two models, demonstrating its robustness in handling different proportions of training and testing data
the C4.5 model using Gain ratio emerged as the preferred choice. The C4.5 model exhibited superior predictive performance with a higher accuracy of 81.78% in the (80% training, 20% testing) partitioning , sensitivity, specificity, and precision compared to the other models. The decision to favor C4.5 is grounded in its ability to capture both positive and negative instances effectively, making it well-suited for the dataset characteristics. The model’s strength lies in identifying clear cases (Low and High risk).
Clustering models are utilized to group data into distinct clusters or groups. In our case, we will apply the k-means clustering algorithm to our dataset and interpret the results, taking into consideration our knowledge of the class label.
Certain factors can impact the efficacy of the final clusters formed when using k-means clustering that we have to be aware. For instance, outliers: Cluster formation is very sensitive to the presence of outliers as that they can pull the cluster towards itself, thus affecting optimal cluster formation. However, we have already addressed this concern in earlier steps.
cdataset = subset(dataset, select = -c(Risk))
We can now use the rest of the attributes for clustering.
The checking is because K-Means algorithm does not work with categorical data.
# 1- view
str(cdataset)
## 'data.frame': 1000 obs. of 9 variables:
## $ isMale : int 1 0 0 1 0 0 1 1 0 1 ...
## $ isBlack : int 1 0 1 1 0 0 0 0 0 0 ...
## $ isSmoker : int 0 0 1 1 1 1 1 1 1 0 ...
## $ isDiabetic : int 1 1 1 1 0 0 0 1 0 1 ...
## $ isHypertensive: int 1 1 1 0 1 1 0 0 1 1 ...
## $ Age : num 0.2308 0.7436 0.2564 0.0513 0.6667 ...
## $ Systolic : num 0.1 0.7 0.827 0.5 0.4 ...
## $ Cholesterol : num 0.729 0.357 0.243 0.514 0.986 ...
## $ HDL : num 0.15 0.487 0.487 0.325 0.537 ...
It’s clear that all 9 variables are numeric of type integer so we can start working on it with no issues.
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
cdataset <- scale(cdataset)
fviz_nbclust(cdataset, kmeans, method = "silhouette")+ labs(subtitle = "silhouette method")
According to silhouette method best number of clusters is K = 2 so will test it along with other high points such as k=4 , k=8.
# 2- prepreocessing
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)
K-means algorithm is non-deterministic, meaning that the clustering outcome can vary each time the algorithm is executed, even when applied to the same dataset. To address this, we will set a seed for the random number generation, ensuring that the results can be reproduced consistently.
# 3- run k-means clustering to find 2 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(cdataset,2)
# print the clusterng result
kmeans.result
## K-means clustering with 2 clusters of sizes 516, 484
##
## Cluster means:
## isMale isBlack isSmoker isDiabetic isHypertensive Age
## 1 -0.02262886 -0.04843516 0.9680116 0.02577952 -0.001627174 -0.02976937
## 2 0.02412499 0.05163749 -1.0320124 -0.02748395 0.001734756 0.03173759
## Systolic Cholesterol HDL
## 1 0.04730009 -0.01946460 -0.007645875
## 2 -0.05042737 0.02075152 0.008151387
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2 2 1 1 1 1 1 1 1 2 1 2 2 1 2 1
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 2
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2 1 1 2 1 2 1 2 1 1 2 2 2 2 1 1
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 2 1 1 1 2 1 1 2 1 2 2 2 1 1 1 2
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 2 2 1 1 2 1 1 1 1 2 2 2 2 1 1
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 1 1 2 2 2 2 2 2 1 2 1 1 2 1 1 1
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 2 2 1 1 2 1 2 1 1 2 2 2 2 1 1 1
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 1 1 2 2 2 2 1 1 1 2 1 2 2 1 2 2
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 1 1 2 2 1 1 2 2 1 1 2 1 2 2 1 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 1 2 2 1 1 1 1 2 1 1 2 2 2 1
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 2 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 1 2 2 2 2 2 2 2 1 2 2 2 1 1 2 2
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 1 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 2 2 1 1 2 2 2 2 2 1 1 2 1 1
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 1 2 1 2 2 2 2 1 2 2 2 2 1 1 2 1
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 1 1 2 1 2 1 2 2 2 2 1 2 2 1 1 2
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 2 2 2 1 1 1 1 1 1 2 1 2 2 1 1 2
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 2 2 1 2 1 1 1 1 2 2 2 2 1 1 2 1
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 1 2 1 2 2 1 1 1 2 2 1 2 2 1 2 2
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 1 1 2 2 2 2 2 2 1 1 2 2 2 1 1 2
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 1
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 1
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
## 1 2 1 1 2 2 1 1 1 1 2 2 2 2 1 1
## 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 1 1 1 1 2 2 1 2 2 1 2 2 2 1 1 2
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## 1 2 1 2 2 1 1 1 1 2 1 2 1 1 2 2
## 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 1 1 2 1 2 1 2 1 2 1 1 2 1 1 1 2
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## 1 2 2 1 1 1 2 1 1 2 1 2 1 1 2 1
## 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
## 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
## 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 2
## 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
## 2 2 1 2 1 2 2 1 2 1 1 1 1 1 2 1
## 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 2 1 2 2 2 2 2 1 1 2 2 1 1 1 1 2
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
## 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 2
## 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 1 1 2 1 1 2 1 2 2 2 2 1 2 2 1 1
## 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608
## 2 2 2 1 1 2 2 1 1 1 1 1 2 1 1 1
## 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
## 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 2
## 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 2 1 1 2 2 2 1 2 2 1 2 1 2 2 2 1
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656
## 1 1 2 2 2 2 1 2 1 2 2 2 1 1 2 1
## 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
## 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 2
## 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688
## 2 2 2 2 1 1 2 1 1 2 1 2 2 1 1 2
## 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704
## 1 1 2 1 1 2 1 1 2 1 2 1 2 1 2 2
## 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 1 2 2 2 2 2 1 1 2 1 2 1 1 2 2 1
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736
## 1 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1
## 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752
## 1 2 2 2 1 1 1 2 1 2 1 2 2 1 2 1
## 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768
## 2 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1
## 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784
## 2 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2
## 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 2 1 2 1 1 1 1 2 1 2 2 2 1 1 2 1
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816
## 1 2 1 2 2 1 1 2 1 2 2 2 1 2 1 1
## 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832
## 2 1 1 2 1 1 2 1 1 2 2 2 1 1 2 2
## 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 1 2 2 1 1 1 2 1 1 2 1 1 2 1 2 2
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864
## 1 2 1 2 2 2 2 1 2 1 2 1 2 2 1 1
## 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 2 2 2 1 1 1 2 1 1 1 2 2 1 1 1 2
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896
## 1 2 2 1 2 2 2 1 1 2 1 1 2 2 1 1
## 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
## 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1 1
## 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
## 1 2 2 1 1 1 2 1 2 1 1 1 2 1 1 2
## 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
## 2 1 1 1 2 2 2 1 1 1 1 2 1 1 2 2
## 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960
## 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1
## 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976
## 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 2
## 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992
## 1 2 1 2 1 2 2 2 1 2 2 2 1 2 1 1
## 993 994 995 996 997 998 999 1000
## 2 2 2 1 2 1 1 2
##
## Within cluster sum of squares by cluster:
## [1] 4105.816 3878.629
## (between_SS / total_SS = 11.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
k-means algorithm is used to identify and assign the data to two clusters so that each observation will be assigned to one of the two clusters. From the output, we can observe that two different clusters have been found with sizes 516 and 484, and the within cluster sum of square (WCSS) =11.2% meaning the clusters are kind of compacted. But we need to visualize it to have a better look.
Cluster Plot:
# 4- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeans.result, data = cdataset)
The plot shows overlapping clusters, particularly in the middle, suggesting that some data points are challenging to assign to a specific cluster. An avegrage silhouette coefficient can be more precise so we will calculate it.
The value is between [-1, 1], a score of 1 denotes the best. And the worst value is -1. Values near 0 denote overlapping clusters.
#Average silhouette
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
## cluster size ave.sil.width
## 1 1 516 0.11
## 2 2 484 0.11
The Average Silhouette Coefficient of 0.11 suggests that there is a certain level of similarity among the data points within the clusters formed through the clustering process. However, the coefficient is relatively low, approaching zero, indicating the presence of overlapping clusters.
To measure the quality of the cluster the average BCubed precision and recall of all objects in the data set is computed:
# Cluster assignments and ground truth labels
cluster_assignments <- kmeans.result$cluster
ground_truth <- dataset$Risk
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
n <- length(cluster_assignments)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- cluster_assignments[i]
label <- ground_truth[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(cluster_assignments == cluster)
# Count the total number of items with the same category
total_same_category <- sum(ground_truth == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
precision <- precision_sum / n # Calculate average precision
recall <- recall_sum / n # Calculate average recall
return(list(precision = precision, recall = recall)) }
# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)
# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall
# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
## BCubed Precision: 0.3299589
## BCubed Recall: 0.5317886
The calculated precision value is 0.32996 not a high value. It means that the clusters are not pure; meaning not all data points in a cluster belong to the same category.
On the other hand, the calculated recall value of 0.53179 implies that approximately half of the objrcts belonging to the same categore are correctly assigned to the same cluster.
Conclusion of K=2:
Considering upove results for K=2 in isolation, without considering our knowledge of the class label, it is evident that the performance is suboptimal (less than ideal). Therefore, it is recommended to explore other values for K in order to achieve better clustering results.
# 2- prepreocessing
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)
# 1- run k-means clustering to find 4 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans_result <- kmeans(cdataset, centers = 4, nstart = 25)
#Accessing kmeans_result
print(kmeans_result)
## K-means clustering with 4 clusters of sizes 240, 255, 244, 261
##
## Cluster means:
## isMale isBlack isSmoker isDiabetic isHypertensive Age
## 1 -0.004998499 0.098461545 -1.0320124 -0.002334427 1.00954535 0.04092810
## 2 -0.101538140 -1.061382078 0.9680116 0.124685876 0.02175491 -0.01063463
## 3 0.052771040 0.005581038 -1.0320124 -0.052221191 -0.98955436 0.02269775
## 4 0.054466405 0.941225616 0.9680116 -0.070853124 -0.02447174 -0.04846424
## Systolic Cholesterol HDL
## 1 -0.03065348 -0.08081696 -0.004490818
## 2 0.08003760 0.02566201 -0.052055040
## 3 -0.06987709 0.12065493 0.020586343
## 4 0.01531517 -0.06355382 0.035742390
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1 1 4 4 2 2 2 2 2 1 2 1 1 4 1 2
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 1 4 4 1 2 3 2 4 1 1 1 1 2 3 1 1
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 1 4 4 1 2 1 2 3 4 2 3 3 3 3 4 2
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 3 2 4 4 3 4 2 3 4 3 3 1 4 2 2 1
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 4 3 1 4 4 3 2 4 2 4 3 1 3 3 4 4
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 4 4 1 1 1 3 3 3 4 1 4 4 3 2 4 2
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 1 1 2 2 1 2 1 4 4 3 1 3 3 2 2 2
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 4 2 1 3 3 1 4 2 4 3 4 1 3 4 3 3
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 4 3 1 4 4 1 3 2 2 1 2 3 1 2 4
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 1 3 2 3 3 2 2 4 4 3 4 2 3 1 1 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 1 4 2 4 4 2 1 1 2 4 1 2 3 2 2 2
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 4 1 1 3 3 1 3 3 4 1 1 1 2 2 1 3
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 3 3 3 4 2 1 1 3 1 3 1 1 2 4 3 4
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 4 3 2 2 2 4 4 3 1 4 3 4 1 2 4 2
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 2 4 3 1 4 4 3 1 1 1 3 2 4 3 2 2
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 2 3 4 3 3 3 3 4 3 1 1 3 4 2 1 2
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 2 4 3 4 3 2 1 1 1 3 2 3 1 4 2 1
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 1 3 1 4 2 4 4 4 4 1 2 3 3 2 2 3
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 3 1 4 3 4 2 4 2 3 1 1 1 2 4 1 2
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 3 1 2 2 2 2 2 2 1 4 4 2 2 4 4 1
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 2 1 2 1 1 2 2 4 1 3 4 3 3 4 3 1
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 2 2 3 1 3 3 3 1 4 4 3 1 3 2 2 1
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 1 1 2 2 3 1 1 1 1 1 1 3 4 4 3 4
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 1 2 3 1 2 1 3 2 2 3 4 3 1 1 3 2
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 2 1 1 3 4 4 2 3 2 2 2 4 3 2 3 4
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
## 2 3 4 2 3 1 2 2 4 4 3 3 1 3 2 4
## 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 2 4 4 2 3 1 2 1 1 2 3 3 1 2 4 3
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## 4 1 2 1 1 4 2 2 2 3 2 3 2 4 1 3
## 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 4 2 3 2 3 4 1 2 1 4 4 1 4 2 4 3
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 2 2 4 2 2 2 4 4 4 4 3 3 3 2 1 2
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## 4 3 1 2 2 4 3 4 2 3 4 1 2 2 1 2
## 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
## 2 2 1 3 3 3 1 3 3 3 3 3 1 4 2 3
## 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
## 4 4 3 2 3 1 2 2 2 4 3 4 3 1 1 1
## 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
## 1 3 4 3 4 1 3 2 1 4 4 4 2 4 3 4
## 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 1 2 3 1 1 1 1 4 2 1 1 4 2 4 4 1
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
## 4 2 1 4 1 4 4 1 3 2 4 3 1 3 1 3
## 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 2 2 1 2 2 1 4 1 1 3 3 4 1 3 4 2
## 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608
## 3 1 1 4 4 3 1 4 2 4 2 4 3 2 2 2
## 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
## 4 1 2 1 2 4 4 1 4 3 2 4 4 3 4 3
## 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 3 4 2 1 1 1 2 1 1 2 3 2 3 3 3 2
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656
## 2 2 3 3 1 1 2 3 2 3 3 1 4 4 1 4
## 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
## 2 4 3 4 3 4 3 4 4 4 2 4 4 4 4 1
## 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688
## 1 3 1 3 2 4 1 2 4 1 4 3 3 4 2 1
## 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704
## 2 4 1 2 4 1 4 4 1 2 3 4 3 4 3 3
## 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 4 3 1 3 1 1 4 4 3 2 3 2 2 3 1 4
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736
## 4 4 1 1 2 2 2 4 4 2 1 3 1 2 4 4
## 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752
## 2 1 1 3 4 2 4 3 4 3 2 1 1 4 3 2
## 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768
## 3 1 4 2 2 4 2 4 4 2 1 3 2 2 3 4
## 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784
## 3 3 3 2 3 4 3 4 1 3 4 1 1 4 3 3
## 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 3 2 3 4 2 2 4 1 4 3 3 3 2 2 3 4
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816
## 2 1 4 3 3 2 2 3 4 3 3 1 2 1 4 2
## 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832
## 1 4 2 3 4 4 1 4 4 3 1 1 2 2 3 1
## 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 2 3 3 2 4 2 3 4 4 1 4 4 3 4 1 1
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864
## 2 1 4 1 3 1 3 2 3 4 3 2 3 3 2 4
## 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 3 3 1 4 2 2 1 4 2 4 3 3 4 2 4 3
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896
## 4 1 1 2 3 1 1 2 4 3 4 4 1 1 4 2
## 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
## 2 4 3 2 2 1 1 4 3 3 1 4 2 3 4 4
## 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
## 2 3 1 2 4 2 3 2 1 2 2 2 1 4 2 3
## 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
## 1 2 4 2 3 1 3 4 4 2 2 1 4 2 3 1
## 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960
## 1 3 1 1 4 1 1 1 1 1 4 2 4 4 4 2
## 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976
## 4 3 2 4 2 2 4 4 2 3 4 4 4 1 2 1
## 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992
## 4 1 4 3 2 3 1 3 2 3 1 3 4 1 4 2
## 993 994 995 996 997 998 999 1000
## 1 1 3 2 3 4 4 3
##
## Within cluster sum of squares by cluster:
## [1] 1648.259 1799.286 1739.876 1778.161
## (between_SS / total_SS = 22.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We can observe that four different clusters have been found with sizes 240 , 255 ,244 and 261. And the within cluster sum of square (WCSS) =22.5% which means that the cluster less compact and cohesive. Its higher than 2 clusters result which means 2 clusters are better in terms of compactness.
Cluster plot :
# 2- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeans_result, data = cdataset)
As we can see In the cluster plot, it’s evident that there are overlapping clusters.
#3-Average silhouette
library(cluster)
avg_sil <- silhouette(kmeans_result$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
## cluster size ave.sil.width
## 1 1 240 0.13
## 2 2 255 0.12
## 3 3 244 0.12
## 4 4 261 0.13
An Average Silhouette coefficient of 0.12 indicate that the clustering is not very well-defined, and there is ambiguity and overlap between clusters. However, the result is higher than 2 clusters.
# Cluster assignments and ground truth labels
cluster_assignments <- kmeans_result$cluster
ground_truth <- dataset$Risk
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
n <- length(cluster_assignments)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- cluster_assignments[i]
label <- ground_truth[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(cluster_assignments == cluster)
# Count the total number of items with the same category
total_same_category <- sum(ground_truth == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
precision <- precision_sum / n # Calculate average precision
recall <- recall_sum / n # Calculate average recall
return(list(precision = precision, recall = recall)) }
# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)
# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall
# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
## BCubed Precision: 0.336335
## BCubed Recall: 0.2729542
The calculated precision value is 0.336335 not a high value it mean the clusters are not pure.and not all data points in a cluster belong to the same category.
The calculated recall value is 0.2729542 it’s a low result meaning most of the data are not in the same cluster.
Conclusion of K=4:
After applying various evaluation metrics such as the average silhouette coefficient, within-cluster sum of squares ,Bcubed precision and recall.it became clear to us that k=4 Is not a good number of clusters since there is overlapping and the clusters are not pure .And the within cluster sum of square 4 clusters has a higher value than 2 cluster indicating that the 4 clusters less compact .but According to the number of class label its the best among the considered options.
# 2- prepreocessing
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)
# 3- run k-means clustering to find 8 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeansresult <- kmeans(cdataset,8)
# print the clusterng result
kmeansresult
## K-means clustering with 8 clusters of sizes 136, 149, 100, 132, 122, 93, 139, 129
##
## Cluster means:
## isMale isBlack isSmoker isDiabetic isHypertensive Age
## 1 0.6374557 0.9412256 0.96801163 -0.11758451 0.4803719 -0.42674815
## 2 0.9928563 0.1348064 -1.03201240 -0.06416429 0.3789569 -0.26292001
## 3 0.8197539 -0.2403129 -0.09200111 -0.06403000 -0.9895544 0.72578390
## 4 -0.9645589 0.5467726 -0.78958524 0.24399312 0.4189023 0.34795779
## 5 -0.6683239 -0.4868635 0.60735156 -0.12602627 -0.9895544 -0.76525097
## 6 -0.4852307 0.3598234 0.86048345 0.31098443 -0.7316060 0.75221709
## 7 -0.3324182 -0.2401689 -1.03201240 -0.23835629 -0.1410156 -0.05978714
## 8 -0.1272486 -1.0613821 0.96801163 0.14986867 1.0095454 0.08076629
## Systolic Cholesterol HDL
## 1 0.06575189 -0.11568304 -0.01941434
## 2 -0.44527267 -0.11897944 -0.04664307
## 3 -0.13781479 0.97280405 0.09080812
## 4 -0.67231955 0.46294127 0.02679507
## 5 -0.42367631 -0.08839685 -0.13312255
## 6 0.50518690 -0.83938009 0.32754430
## 7 1.08076911 -0.37828447 -0.03913655
## 8 0.11170739 0.12791071 -0.09153707
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2 7 1 1 8 8 5 3 8 7 3 7 2 1 2 6
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 2 1 1 7 5 3 8 1 2 2 4 2 8 3 7 4
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 2 1 6 2 6 2 3 3 6 5 3 7 4 3 5 5
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 2 6 1 1 4 5 8 7 6 5 3 4 5 8 8 2
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 1 4 2 1 6 2 5 1 8 6 4 4 3 2 1 6
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 6 6 4 4 2 2 7 2 5 7 5 6 4 8 1 8
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 4 2 5 8 7 3 4 1 5 7 7 5 4 5 5 6
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 6 3 4 7 5 2 6 8 3 7 6 2 2 1 7 2
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 8 4 4 7 3 1 4 4 6 8 2 5 7 2 6 1
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 3 8 3 3 5 8 1 1 2 1 8 6 4 7 5
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 2 1 8 1 1 8 4 2 5 6 4 5 7 8 8 5
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 1 2 4 2 3 2 3 7 5 2 4 7 8 8 4 7
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 2 3 7 3 8 2 7 2 7 2 4 2 8 1 4 1
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 6 7 3 8 8 1 6 2 7 1 7 1 4 6 1 8
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 8 3 7 2 1 3 7 4 7 2 7 8 6 7 8 8
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
## 8 4 1 4 7 4 2 1 3 2 2 7 5 8 4 8
## 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## 5 1 4 3 5 6 4 4 7 5 8 3 7 1 8 7
## 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## 4 7 2 5 8 6 1 5 6 2 8 3 5 5 5 4
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
## 2 2 6 7 1 8 1 8 7 4 4 7 6 1 2 5
## 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 7 7 8 5 8 3 5 8 7 6 1 3 8 1 1 4
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
## 5 2 6 7 2 8 5 4 2 3 6 7 3 5 7 2
## 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
## 8 8 4 4 2 2 7 4 1 1 7 4 3 8 3 4
## 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368
## 2 7 5 8 4 7 4 4 2 2 2 3 1 1 2 4
## 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384
## 4 3 7 4 8 7 2 5 8 7 1 7 2 2 3 6
## 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 6 2 4 3 1 1 8 4 8 5 8 1 7 8 2 4
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
## 8 7 1 8 7 2 3 8 1 6 7 3 7 5 8 1
## 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## 5 1 4 8 2 4 8 4 2 8 7 4 4 5 3 7
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448
## 1 2 8 4 7 5 8 5 8 2 8 3 6 6 4 3
## 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
## 3 8 2 8 7 1 4 3 4 1 1 2 6 6 1 2
## 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 8 8 6 8 6 5 3 6 1 6 7 5 7 5 2 3
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496
## 5 4 7 8 5 1 6 6 8 2 1 2 8 5 2 8
## 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512
## 8 5 2 4 4 2 2 5 5 2 4 5 7 1 8 5
## 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528
## 6 6 6 8 3 7 8 6 6 4 5 4 2 7 4 2
## 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
## 7 3 1 5 1 7 2 3 4 6 1 1 3 3 7 1
## 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560
## 2 3 2 7 2 4 4 6 3 7 7 1 8 5 4 7
## 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576
## 6 5 4 1 2 1 6 2 6 3 6 7 4 4 2 3
## 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592
## 3 8 7 8 8 4 6 4 2 2 2 4 4 7 6 8
## 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608
## 4 7 2 4 1 7 4 5 5 3 8 3 3 8 6 8
## 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624
## 1 2 5 4 3 1 6 2 1 3 8 6 1 3 4 7
## 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640
## 2 5 6 2 4 7 8 2 7 5 4 6 3 7 3 5
## 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656
## 8 5 7 7 7 2 5 7 5 2 5 2 6 1 7 1
## 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
## 8 1 3 4 3 5 7 1 1 1 5 1 3 1 1 4
## 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688
## 4 3 4 3 8 5 2 5 1 4 1 7 2 5 6 2
## 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704
## 3 6 7 6 5 2 6 1 4 8 7 6 7 1 3 7
## 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720
## 6 7 4 4 2 7 1 1 7 5 7 6 8 5 7 6
## 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736
## 5 1 4 7 8 8 8 1 1 6 2 2 4 5 1 5
## 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752
## 8 4 2 4 6 5 1 2 1 2 5 4 2 5 3 8
## 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768
## 4 2 5 3 8 6 3 6 6 5 2 5 5 8 3 1
## 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784
## 3 7 3 8 7 6 2 1 4 7 5 2 4 1 3 2
## 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800
## 3 5 4 5 8 8 3 2 5 4 4 5 5 3 2 1
## 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816
## 3 2 1 5 7 8 8 2 5 7 2 4 6 2 6 3
## 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832
## 7 1 3 3 1 4 4 1 1 4 4 7 5 5 5 4
## 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848
## 5 4 7 8 1 5 5 5 5 2 5 1 3 4 7 2
## 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864
## 8 2 6 2 4 2 7 3 7 1 5 8 7 7 3 4
## 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880
## 3 3 4 6 8 5 2 6 5 1 2 7 6 8 1 7
## 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896
## 1 7 4 8 7 7 2 5 1 7 1 1 2 7 1 5
## 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
## 6 4 7 5 8 2 4 1 2 7 2 1 8 3 1 1
## 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928
## 8 3 2 8 1 3 7 8 7 8 5 5 4 1 8 3
## 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
## 2 3 5 8 4 2 4 6 3 8 6 4 1 5 7 2
## 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960
## 7 7 4 7 1 4 4 7 2 4 1 5 1 1 1 8
## 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976
## 6 2 3 1 5 8 1 1 5 3 6 1 1 2 5 7
## 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992
## 6 7 1 3 8 7 2 4 8 7 2 2 6 2 5 8
## 993 994 995 996 997 998 999 1000
## 2 2 7 8 6 1 6 7
##
## Within cluster sum of squares by cluster:
## [1] 797.7227 949.7918 641.1214 793.4957 737.4096 520.4053 918.1900 761.7654
## (between_SS / total_SS = 31.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
We can observe that the eight different clusters have been found with sizes 136, 149, 100, 132,93, 139 and 129 respectively, and the within cluster sum of square (WCSS) = 31.9%. which is higher than 2 and 4 clusters result which means 2,4 clusters are better in terms of compactness or homogeneity compared to the clustering result of 8 clusters.
Cluster Plot:
# 2- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeansresult, data = cdataset)
It’s clear that the eight clusters are overlapping.
#Average silhouette
library(cluster)
avg_sil <- silhouette(kmeansresult$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
## cluster size ave.sil.width
## 1 1 136 0.12
## 2 2 149 0.09
## 3 3 100 0.08
## 4 4 132 0.12
## 5 5 122 0.10
## 6 6 93 0.13
## 7 7 139 0.06
## 8 8 129 0.12
An Average Silhouette Coefficient of 0.1 indicates that, the clusters formed in the clustering process have some degree of similarity among their data points. However, the result is lower than 2 clusters which has silhouette coefficient average of 0.11 and also lower than K=4 clusters that is equal to 0.12.
# Cluster assignments and ground truth labels
cluster_assignments <- kmeansresult$cluster
ground_truth <- dataset$Risk
# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
n <- length(cluster_assignments)
precision_sum <- 0
recall_sum <- 0
for (i in 1:n) {
cluster <- cluster_assignments[i]
label <- ground_truth[i]
# Count the number of items from the same category within the same cluster
same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)
# Count the total number of items in the same cluster
total_same_cluster <- sum(cluster_assignments == cluster)
# Count the total number of items with the same category
total_same_category <- sum(ground_truth == label)
# Calculate precision and recall for the current item and add them to the sums
precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
recall_sum <- recall_sum + same_category_same_cluster / total_same_category
}
precision <- precision_sum / n # Calculate average precision
recall <- recall_sum / n # Calculate average recall
return(list(precision = precision, recall = recall)) }
# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)
# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall
# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
## BCubed Precision: 0.3747497
## BCubed Recall: 0.1554135
The calculated precision value is 0.37478 not a high value it mean the clusters are not pure.
The calculated recall value is 0.15541 it’s a low result meaning most of the data are not in the same cluster.
Conclusion of K=8:
Is not a good number of clusters especially when compared to the results obtained with K=2 and K=4 clusters. This conclusion is based on various evaluation metrics such as the average silhouette coefficient, within-cluster sum of squares, and Bcubed precision and recall. In all aspects, K=8 performed the worst. Additionally, considering the presence of class labels and our prior knowledge of the data set, we know the actual number of groups within the class label. So, by also taking this information into account, we can determine that K=8 is not an optimal number of clusters.
library(NbClust)
#a)fviz_nbclust() with silhouette method using library(factoextra)
fviz_nbclust(cdataset, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
#b) NbClust validation
fres.nbclust <- NbClust(cdataset, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning in log(det(P)/det(W)): NaNs produced
## Warning: did not converge in 10 iterations
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 3 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
According to the NbClust validation method, which utilizes the majority rule, the best number of clusters is 4. This number contradicts the initial suggestion from the silhouette method, which indicated that the best number of clusters is 2. However, upon revisiting the calculations and evaluating the performance, it is almost accurate to conclude that K=4 indeed performs the best among the considered options.